底样采样是非参数分类的最大最佳鲁棒性干预措施

论文标题

底样采样是非参数分类的最大最佳鲁棒性干预措施

Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification

论文作者

Chatterji, Niladri S., Haque, Saminul, Hashimoto, Tatsunori

论文摘要

尽管已经提出了广泛的技术来解决分配变化，但对$ \ textit {undSampled} $平衡数据集的简单基线通常可以在几个流行的基准中实现与最先决性的相近。这是令人惊讶的，因为丢弃多数群体数据的底样采样算法。为了了解这种现象，我们询问学习是否从根本上受到缺乏少数群体样本的限制。我们证明，在非参数二进制分类的情况下确实是这种情况。我们的结果表明，在最坏的情况下，除非火车和测试分布之间存在高度重叠（在现实世界数据集中不太可能是这种情况），否则算法不能胜过实体采样，或者如果算法利用算法在分布移位方面的其他结构。特别是，在标签偏移的情况下，我们表明始终有一种最小值最佳采样算法。在群体循环的情况下，我们表明，当组分布之间的重叠很小时，有一种最小值的底样采样算法。我们还对标签移位数据集进行了实验案例研究，发现与我们的理论相一致，可靠的神经网络分类器的测试准确性受到少数样本的数量的限制。

While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题