论文标题
不确定文档跨度的恒定列举
Constant-Delay Enumeration for Nondeterministic Document Spanners
论文作者
论文摘要
我们考虑称为文档跨度的信息提取框架,并研究从输入文档中提取结果的问题,其中提取任务被描述为一个顺序可变设备自动机(VA)。我们在枚举算法的设置中提出了这个问题,我们可以首先运行一个预处理阶段,然后必须在任何两个连续的结果之间以较小的延迟产生结果。我们的目标是拥有一种算法,该算法在组合的复杂性中,即输入文档和VA的大小。同时确保输入文档大小的最佳数据复杂性界限,即文档大小的恒定延迟。 PODS'18最近的几项作品提出了这种算法,但文档大小的线性延迟或指数依赖性的依赖性(通常不确定)输入VA。特别是Florenzano等。建议对于一般顺序VA,无法满足我们所需的运行时保证。我们对此进行驳斥并表明,鉴于未确定的顺序VA和输入文档,我们可以用以下范围枚举文档上VA的映射:文档大小的预处理是线性的,并且在VA的尺寸中是多项式的,并且延迟在VA的尺寸中独立于文档和多种物体。因此,所得算法在组合复杂性和最佳数据复杂性范围中实现了障碍。此外,很容易描述,特别是对于所谓的扩展VAS的受限情况。最后,我们使用原型实现从经验上评估算法。
We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.