在度量空间中进行精确相似性搜索的学习指数

论文标题

在度量空间中进行精确相似性搜索的学习指数

A Learned Index for Exact Similarity Search in Metric Spaces

论文作者

Tian, Yao, Yan, Tingyun, Zhao, Xi, Huang, Kai, Zhou, Xiaofang

论文摘要

索引是支持大型数据库中有效查询处理的有效方法。最近，通过机器学习模型替代或补充传统索引结构的学习指数的概念已被积极探索以降低存储和搜索成本。但是，在高维度空间中准确有效的相似性查询处理仍然是一个开放的挑战。在本文中，我们提出了一种称为LIMS的新型索引方法，该方法使用数据聚类，基于数据转换的数据转换技术和学习的索引来支持度量空间中的有效相似性查询处理。在LIM中，将基础数据分配到簇中，以使每个群集都遵循相对均匀的数据分布。通过利用每个集群的少量枢轴来实现数据重新分布。类似的数据被映射到紧凑的区域，而映射的值完全有序。开发机器学习模型是为了近似于磁盘上每个数据记录的位置。有效的算法设计用于基于LIMS的处理范围查询和最近的邻居查询，以及具有动态更新的索引维护。与传统索引和最先进的索引相比，对现实世界和合成数据集的广泛实验证明了LIM的优势。

Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index, which replaces or complements traditional index structures with machine learning models, has been actively explored to reduce storage and search costs. However, accurate and efficient similarity query processing in high-dimensional metric spaces remains to be an open challenge. In this paper, we propose a novel indexing approach called LIMS that uses data clustering, pivot-based data transformation techniques and learned indexes to support efficient similarity query processing in metric spaces. In LIMS, the underlying data is partitioned into clusters such that each cluster follows a relatively uniform data distribution. Data redistribution is achieved by utilizing a small number of pivots for each cluster. Similar data are mapped into compact regions and the mapped values are totally ordinal. Machine learning models are developed to approximate the position of each data record on disk. Efficient algorithms are designed for processing range queries and nearest neighbor queries based on LIMS, and for index maintenance with dynamic updates. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes and state-of-the-art learned indexes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题