绝地武士：这些不是您要寻找的JSON文档...（扩展版本*）

论文标题

绝地武士：这些不是您要寻找的JSON文档...（扩展版本*）

JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

论文作者

Hütter, Thomas, Augsten, Nikolaus, Kirsch, Christoph M., Carey, Michael J., Li, Chen

论文摘要

JavaScript对象表示法（JSON）是文档存储中使用的流行数据格式，用于本地支持半结构化数据。在本文中，我们解决了JSON相似性查找查询的问题：给定查询文档和一个距离阈值$τ$，从查询文档中检索所有在$τ$之内的JSON文档。由于其递归定义，JSON数据自然表示为树。与XML等其他层次结构格式不同，JSON支持单个文档中有序和无序的兄弟姐妹集合。此功能对树模型和距离计算提出了新的挑战。我们提出了JSON TREED，这是JSON文档的无损树表示，并定义JSON编辑距离（JEDI），这是JSON文档的第一个基于编辑的距离度量。我们开发了一种名为QuickJedi的算法，用于通过利用一种新技术来修剪昂贵的同级匹配来计算绝地武士。它在运行时的大数量级以优于基线算法。为了提高JSON相似性查询的性能，我们引入了一个名为JSIM的索引，并根据树木分类一个高效的上限。我们的上限算法以$ O（Nτ）$时间和$ o（n +τ\ log n）$空间运行，这显着改善了先前的$ O（n^2）$ time和$ o（n \ log n）$ space（其中$ n $是树大小）。我们的实验评估表明，我们的解决方案范围缩放到具有数百万个节点的数百万个文档和JSON树的数据库。

The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold $τ$, retrieve all JSON documents that are within $τ$ from the query document. Due to its recursive definition, JSON data are naturally represented as trees. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document. This feature poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON documents. We develop an algorithm, called QuickJEDI, for computing JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and a highly effective upper bound based on tree sorting. Our algorithm for the upper bound runs in $O(n τ)$ time and $O(n + τ\log n)$ space, which substantially improves the previous best bound of $O(n^2)$ time and $O(n \log n)$ space (where $n$ is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题