论文标题
改善开放群集普查。 I.应用于Gaia DR2数据的聚类算法的比较
Improving the open cluster census. I. Comparison of clustering algorithms applied to Gaia DR2 data
论文作者
论文摘要
在银河系中,开放群集的人口普查是在从未见过的磁通状态下。最近的作品报道了数百个新的开放式群集,这要归功于Gaia卫星的令人难以置信的星形质量,但其他作品还报告说,在Gaia之前发现的许多开放式群集可能是关联。我们的目标是对用于检测开放群集的聚类算法进行比较,试图通过得出每个样品的灵敏度,特异性和精度来统计量化其优势和劣势,并针对较大的样品进行正面速度。我们选择了DBSCAN,HDBSCAN和Gaussian混合模型,以进行进一步研究,这是因为它们与Gaia数据一起使用。我们开发了用于GAIA数据的预处理管道,并为特定应用程序开发了算法以打开簇。我们得出了研究中所有1385个开放群集的检测率,以及其中100个开放簇的更详细的性能统计。 DBSCCAN在样本中敏感到50%至62%的真正正开放簇,通常具有很好的特异性和精度。 HDBSCAN以更高的灵敏度为82%,尤其是在开放群集的不同距离和尺度上。高斯混合物模型缓慢,仅对样品中的33%的开放簇敏感,这往往是较大的物体。此外,我们报告了HDBSCAN检测到的41个新的开放集群候选者,其中三个比500 pc更接近。当与其他后处理以减轻其假阳性时,我们发现HDBSCAN是恢复GAIA数据中开放式群集的最敏感和有效算法。我们的结果表明,在Gaia数据中尚未检测到更多新的且已经报道的开放式群集。
The census of open clusters in the Milky Way is in a never-before seen state of flux. Recent works have reported hundreds of new open clusters thanks to the incredible astrometric quality of the Gaia satellite, but other works have also reported that many open clusters discovered in the pre Gaia era may be associations. We aim to conduct a comparison of clustering algorithms used to detect open clusters, attempting to statistically quantify their strengths and weaknesses by deriving the sensitivity, specificity, and precision of each as well as their true positive rate against a larger sample. We selected DBSCAN, HDBSCAN, and Gaussian mixture models for further study, owing to their speed and appropriateness for use with Gaia data. We developed a preprocessing pipeline for Gaia data and developed the algorithms further for the specific application to open clusters. We derived detection rates for all 1385 open clusters in the fields in our study as well as more detailed performance statistics for 100 of these open clusters. DBSCAN was sensitive to 50% to 62% of the true positive open clusters in our sample, with generally very good specificity and precision. HDBSCAN traded precision for a higher sensitivity of up to 82%, especially across different distances and scales of open clusters. Gaussian mixture models were slow and only sensitive to 33% of open clusters in our sample, which tended to be larger objects. Additionally, we report on 41 new open cluster candidates detected by HDBSCAN, three of which are closer than 500 pc. When used with additional post-processing to mitigate its false positives, we have found that HDBSCAN is the most sensitive and effective algorithm for recovering open clusters in Gaia data. Our results suggest that many more new and already reported open clusters have yet to be detected in Gaia data.