固定和自适应地标设置有限的伪计

论文标题

固定和自适应地标设置有限的伪计

Fixed and adaptive landmark sets for finite pseudometric spaces

论文作者

Brunson, Jason Cory, Skaf, Yara

论文摘要

拓扑数据分析（TDA）是一个扩展的字段，它利用代数拓扑的原理和工具来量化数据集的结构特征或将其转换为更易于管理的形式。随着其理论基础的发展，TDA在从高维，嘈杂和复杂的数据（例如生物医学中使用的数据）中提取有用的信息方面表现出了希望。为了提高效率，这些技术可以采用具有里程碑意义的采样器。启发式maxmin程序通过隐式构造包含一组均匀半径的盖子来获得样品点的大致分布。但是，与生物医学中常见的数据一样，密度变化或包含多个点的数据都会出现问题。我们提出了一个基于排名距离的类似程序，即“最后一个”，这意味着覆盖范围包括一组均匀的基数。我们首先严格定义该过程，并证明它获得具有所需属性的地标。然后，我们执行基准测试并将其性能与Maxmin的性能进行比较，并在涉及模拟和现实生物医学数据的特征检测和类预测任务中进行比较。 LastFirst比Maxmin更一般，因为它可以应用于可以计算任意（和不一定是对称）成对距离的任何数据。最后，最新的计算成本更高，但是我们的实施量表的尺度与Maxmin相同。我们发现LastFirst在预测任务上取得了可比的性能，并且在同源检测任务上的Maxmin优于Maxmin。在相似性措施的数值并不有意义的情况下，在许多生物医学环境中，最后一个采样也可以提高可解释性。

Topological data analysis (TDA) is an expanding field that leverages principles and tools from algebraic topology to quantify structural features of data sets or transform them into more manageable forms. As its theoretical foundations have been developed, TDA has shown promise in extracting useful information from high-dimensional, noisy, and complex data such as those used in biomedicine. To improve efficiency, these techniques may employ landmark samplers. The heuristic maxmin procedure obtains a roughly even distribution of sample points by implicitly constructing a cover comprising sets of uniform radius. However, issues arise with data that vary in density or include points with multiplicities, as are common in biomedicine. We propose an analogous procedure, "lastfirst" based on ranked distances, which implies a cover comprising sets of uniform cardinality. We first rigorously define the procedure and prove that it obtains landmarks with desired properties. We then perform benchmark tests and compare its performance to that of maxmin, on feature detection and class prediction tasks involving simulated and real-world biomedical data. Lastfirst is more general than maxmin in that it can be applied to any data on which arbitrary (and not necessarily symmetric) pairwise distances can be computed. Lastfirst is more computationally costly, but our implementation scales at the same rate as maxmin. We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks. Where the numerical values of similarity measures are not meaningful, as in many biomedical contexts, lastfirst sampling may also improve interpretability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题