论文标题
量化人类偏见和知识以指导训练期间的ML模型
Quantifying Human Bias and Knowledge to guide ML models during Training
论文作者
论文摘要
本文讨论了一种基于众包的方法,我们旨在量化数据集不同属性在确定分类问题结果中的重要性。人类提供的这种启发式是机器学习模型的初始体重种子,并在梯度下降过程中指导模型提高了更好的最佳状态。通常,处理数据时,处理偏斜的数据集并不少见,而这些数据集代表某些类的项目,而其余部分则不足。偏斜的数据集可能会导致模型诸如学习有偏见或过度拟合的模型的无法预料的问题。监督学习中的传统数据增强技术包括对合成数据的过度采样和培训。我们引入了一种实验方法来处理这种不平衡数据集,通过将人类包括在培训过程中。我们要求人类对数据集特征的重要性进行排名,并通过等级聚合确定模型的初始权重偏差。我们表明,集体人类偏见可以允许ML模型学习有关真实人群的见解,而不是偏见的样本。在本文中,我们使用两种等级聚合方法Kemeny Young和Markov Chain聚合器来量化人类对特征重要性的看法。这项工作主要检验人类知识对两个ML模型的二进制分类(流行与不受欢迎)问题的有效性:深神经网络和支持向量机。这种方法将人类视为弱学习者,并依靠聚集来抵消单个偏见和领域不熟悉。
This paper discusses a crowdsourcing based method that we designed to quantify the importance of different attributes of a dataset in determining the outcome of a classification problem. This heuristic, provided by humans acts as the initial weight seed for machine learning models and guides the model towards a better optimal during the gradient descent process. Often times when dealing with data, it is not uncommon to deal with skewed datasets, that over represent items of certain classes, while underrepresenting the rest. Skewed datasets may lead to unforeseen issues with models such as learning a biased function or overfitting. Traditional data augmentation techniques in supervised learning include oversampling and training with synthetic data. We introduce an experimental approach to dealing with such unbalanced datasets by including humans in the training process. We ask humans to rank the importance of features of the dataset, and through rank aggregation, determine the initial weight bias for the model. We show that collective human bias can allow ML models to learn insights about the true population instead of the biased sample. In this paper, we use two rank aggregator methods Kemeny Young and the Markov Chain aggregator to quantify human opinion on importance of features. This work mainly tests the effectiveness of human knowledge on binary classification (Popular vs Not-popular) problems on two ML models: Deep Neural Networks and Support Vector Machines. This approach considers humans as weak learners and relies on aggregation to offset individual biases and domain unfamiliarity.