论文标题
通过对比度学习学习有用的射电天文学的有用表示
Learning useful representations for radio astronomy "in the wild" with contrastive learning
论文作者
论文摘要
先前已证明未标记的天体物理训练数据中未知类别的分布会对训练和验证集之间的数据集变化引起的模型性能有害。对于射电星系分类,我们在这项工作中证明了在训练之前从未标记的数据中消除低角度范围来源的较低角度来源为对比模型产生质量不同的训练动力。通过将模型应用于具有未知类平衡和子群体分布的未标记的数据集,以生成射电星系的表示空间,我们表明,使用适当的切割阈值,我们可以找到具有FRI/FRII类别分离的代表,接近了受监督的基线基线的明确训练,可以将经过明确训练的基线训练,以将这两个射电射流分隔为这两个类别。此外,我们表明,过度保守的截止阈值可以在验证准确性上增加任何提高。然后,我们将学习的表示形式用于在稀有混合源上执行相似性搜索的下游任务,发现对比模型可以可靠地返回语义相似的样本,并增加了在预处理后保留的重复项的附加奖励。
Unknown class distributions in unlabelled astrophysical training data have previously been shown to detrimentally affect model performance due to dataset shift between training and validation sets. For radio galaxy classification, we demonstrate in this work that removing low angular extent sources from the unlabelled data before training produces qualitatively different training dynamics for a contrastive model. By applying the model on an unlabelled data-set with unknown class balance and sub-population distribution to generate a representation space of radio galaxies, we show that with an appropriate cut threshold we can find a representation with FRI/FRII class separation approaching that of a supervised baseline explicitly trained to separate radio galaxies into these two classes. Furthermore we show that an excessively conservative cut threshold blocks any increase in validation accuracy. We then use the learned representation for the downstream task of performing a similarity search on rare hybrid sources, finding that the contrastive model can reliably return semantically similar samples, with the added bonus of finding duplicates which remain after pre-processing.