论文标题
更快的dbscan通过子采样相似性查询
Faster DBSCAN via subsampled similarity queries
论文作者
论文摘要
DBSCAN是一种流行的基于密度的聚类算法。它计算数据集的$ε$ -Neighborhood图,并使用高度节点的连接组件来确定簇。但是,完整的邻里图可能太昂贵了,无法以$ O(n^2)$的最差复杂性进行计算。在本文中,我们提出了一个名为SNG-DBSCAN的简单变体,该变体基于基于$ε$ - 纽伯尼图的群集,仅需要访问成对点的相似性查询,尤其是避免了需要任何需要数据点本身嵌入的复杂数据结构。该过程的运行时间为$ O(SN^2)$,其中$ s $是采样率。我们在一些自然的理论假设下表明,$ s \ log n/n $足以满足统计群集恢复的保证,从而导致$ o(n \ log n)$复杂性。我们提供了广泛的实验分析,表明在大型数据集上,一个邻里图的$ 0.1 \%$的子样本,与Scikit-Learn的DBSCAN实施相比,RAM消耗的速度超过200倍,而降低了250倍,同时仍然保持竞争性的聚类性能。
DBSCAN is a popular density-based clustering algorithm. It computes the $ε$-neighborhood graph of a dataset and uses the connected components of the high-degree nodes to decide the clusters. However, the full neighborhood graph may be too costly to compute with a worst-case complexity of $O(n^2)$. In this paper, we propose a simple variant called SNG-DBSCAN, which clusters based on a subsampled $ε$-neighborhood graph, only requires access to similarity queries for pairs of points and in particular avoids any complex data structures which need the embeddings of the data points themselves. The runtime of the procedure is $O(sn^2)$, where $s$ is the sampling rate. We show under some natural theoretical assumptions that $s \approx \log n/n$ is sufficient for statistical cluster recovery guarantees leading to an $O(n\log n)$ complexity. We provide an extensive experimental analysis showing that on large datasets, one can subsample as little as $0.1\%$ of the neighborhood graph, leading to as much as over 200x speedup and 250x reduction in RAM consumption compared to scikit-learn's implementation of DBSCAN, while still maintaining competitive clustering performance.