论文标题

ZIPF的标签平滑,有效的一次自我验证

Efficient One Pass Self-distillation with Zipf's Label Smoothing

论文作者

Liang, Jiajun, Li, Linze, Bing, Zhaodong, Zhao, Borui, Tang, Yao, Lin, Bo, Fan, Haoqiang

论文摘要

自我介绍在训练过程中利用自身的非均匀软监管,并在没有任何运行时成本的情况下提高性能。但是,在训练期间的开销经常被忽略,但是在巨型模型的时代,培训期间的时间和记忆开销越来越重要。本文提出了一种名为ZIPF的标签平滑(ZIPF的LS)的有效自我验证方法,该方法使用网络的直立预测来生成软监管,该软监管符合ZIPF分布而无需使用任何对比样本或辅助参数。我们的想法来自经验观察,即当网络经过适当训练时,在按样本的大小和平均分类后,应遵循分布的分布,让人联想到ZIPF的自然语言频率统计信息。通过在样本级别和整个培训期内强制执行此属性,我们发现预测准确性可以大大提高。使用INAT21细粒分类数据集上的RESNET50,与香草基线相比,我们的技术可实现 +3.61%的准确性增长,而对于先前的标签平滑或自我验证策略,我们的技术增长了0.88%。该实现可在https://github.com/megvii-research/zipfls上公开获得。

Self-distillation exploits non-uniform soft supervision from itself during training and improves performance without any runtime cost. However, the overhead during training is often overlooked, and yet reducing time and memory overhead during training is increasingly important in the giant models' era. This paper proposes an efficient self-distillation method named Zipf's Label Smoothing (Zipf's LS), which uses the on-the-fly prediction of a network to generate soft supervision that conforms to Zipf distribution without using any contrastive samples or auxiliary parameters. Our idea comes from an empirical observation that when the network is duly trained the output values of a network's final softmax layer, after sorting by the magnitude and averaged across samples, should follow a distribution reminiscent to Zipf's Law in the word frequency statistics of natural languages. By enforcing this property on the sample level and throughout the whole training period, we find that the prediction accuracy can be greatly improved. Using ResNet50 on the INAT21 fine-grained classification dataset, our technique achieves +3.61% accuracy gain compared to the vanilla baseline, and 0.88% more gain against the previous label smoothing or self-distillation strategies. The implementation is publicly available at https://github.com/megvii-research/zipfls.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源