处理监督学习数据中缺失的注释

论文标题

处理监督学习数据中缺失的注释

Handling Missing Annotations in Supervised Learning Data

论文作者

Abdel-Hakim, Alaa E., Deabes, Wael

论文摘要

数据注释是监督学习的重要阶段。但是，注释过程详尽且耗时，特别是对于大型数据集。日常生活的活动（ADL）识别是利用非常大的原始传感器数据读数的系统的一个示例。在这样的系统中，传感器读数以24/7的方式从活动监控传感器中收集。生成的数据集的大小是如此之大，以至于人类注释者几乎不可能为数据集中的每个实例提供一定的标签。这导致输入数据的注释差距到采用监督学习系统。识别系统的性能会受到这些差距的负面影响。在这项工作中，我们提出并研究三个不同的范式来解决这些差距。在第一个范式中，通过删除所有未标记的读数来删除差距。在第二个范式的操作中，给出了单个“未知”或“无名”标签。最后一个范式通过给每个范围的范围来解决这些差距，以识别确定性标签的独特标签。此外，我们通过构建其中一些范式的混合组合以进一步改善，提出了一种注释差距的语义预处理方法。使用的ADL基准数据集评估了拟议的三个范式及其混合组合的性能，该数据集包含超过$ 2.5 \ times 10^6 $传感器读数，这些传感器读数已超过九个月。评估结果强调了每个范式操作下的性能对比，并支持特定的差距处理方法以提高性能。

Data annotation is an essential stage in supervised learning. However, the annotation process is exhaustive and time consuming, specially for large datasets. Activities of Daily Living (ADL) recognition is an example of systems that exploit very large raw sensor data readings. In such systems, sensor readings are collected from activity-monitoring sensors in a 24/7 manner. The size of the generated dataset is so huge that it is almost impossible for a human annotator to give a certain label to every single instance in the dataset. This results in annotation gaps in the input data to the adopting supervised learning system. The performance of the recognition system is negatively affected by these gaps. In this work, we propose and investigate three different paradigms to handle these gaps. In the first paradigm, the gaps are taken out by dropping all unlabeled readings. A single "Unknown" or "Do-Nothing" label is given to the unlabeled readings within the operation of the second paradigm. The last paradigm handles these gaps by giving every one of them a unique label identifying the encapsulating deterministic labels. Also, we propose a semantic preprocessing method of annotation gaps by constructing a hybrid combination of some of these paradigms for further performance improvement. The performance of the proposed three paradigms and their hybrid combination is evaluated using an ADL benchmark dataset containing more than $2.5\times 10^6$ sensor readings that had been collected over more than nine months. The evaluation results emphasize the performance contrast under the operation of each paradigm and support a specific gap handling approach for better performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题