论文标题

关于聚类问题的近似性,没有候选中心

On Approximability of Clustering Problems Without Candidate Centers

论文作者

Cohen-Addad, Vincent, S., Karthik C., Lee, Euiwoong

论文摘要

K-均值目标可以说是在度量空间中建模聚类任务的最广泛使用的成本函数。实际上,从历史上看,K-均在连续的环境中被认为,即中心可以位于公制空间的任何地方。例如,受欢迎的劳埃德(Lloyd)的启发主义将一个中心定位在每个集群的平均值。 尽管在理解K-均值的近似性以及其他经典聚类问题(例如K-Median和K-Minsum)方面的努力,但我们对这些问题近似因素的硬度的了解仍然很差。在本文中,我们可以显着改善文献中已知的这些目标的近似因素的硬度。我们表明,如果输入位于一般度量空间中,则NP是近似值的: $ \ bullet $连续k-median,$ 2-o(1)$;这改善了Guha和Khuller(J。算法'99)所示的先前的不Xibibibibibibibility因子。 $ \ bullet $连续k-means $ 4- o(1)$;这改善了Guha和Khuller(J。算法'99)所示的2.10的先前不Xibibibibibibility因子。 $ \ bullet $ k-minsum,$ 1.415 $;这改善了Guruswami和Indyk(Soda '03)所显示的APX硬度。 我们的结果为连续设置中的聚类问题与离散设置之间的差异(在候选中心作为输入的一部分给出)提供了新的,也许是反直觉的。

The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For example, the popular Lloyd's heuristic locates a center at the mean of each cluster. Despite persistent efforts on understanding the approximability of k-means, and other classic clustering problems such as k-median and k-minsum, our knowledge of the hardness of approximation factors of these problems remains quite poor. In this paper, we significantly improve upon the hardness of approximation factors known in the literature for these objectives. We show that if the input lies in a general metric space, it is NP-hard to approximate: $\bullet$ Continuous k-median to a factor of $2-o(1)$; this improves upon the previous inapproximability factor of 1.36 shown by Guha and Khuller (J. Algorithms '99). $\bullet$ Continuous k-means to a factor of $4- o(1)$; this improves upon the previous inapproximability factor of 2.10 shown by Guha and Khuller (J. Algorithms '99). $\bullet$ k-minsum to a factor of $1.415$; this improves upon the APX-hardness shown by Guruswami and Indyk (SODA '03). Our results shed new and perhaps counter-intuitive light on the differences between clustering problems in the continuous setting versus the discrete setting (where the candidate centers are given as part of the input).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源