论文标题

文档跨度语法

Grammars for Document Spanners

论文作者

Peterfreund, Liat

论文摘要

我们提出了一种新的基于语法的语言,用于从文档(文本)定义信息提取器,该语言构建在文档跨度的经过精心研究的框架上,用于从文本中提取结构化数据。虽然先前研究的文档跨度形式主义主要基于正则表达式,但我们使用称为{refaction语法}的无上下文语法的扩展来定义新的无上下文跨度类别。提取语法仅仅是扩展的无上下文语法,并具有捕获文档间隔位置的变量,即跨越跨度。虽然正则表达式有效地用于令牌化和标记,但无上下文的语法也有效地捕获结构属性。确实,我们表明,无上下文的跨度比其常规同行更具表现力。我们推理了我们的新班级的表现力,并提出了捕获它的俯卧撑模型。我们表明,可以通过多项式数据复杂性评估提取语法。然而,由于多项式的程度取决于查询,因此我们提出了一种明确提取语法的枚举算法,该算法在五五位化预处理后,在每两个连续两次连续的情况下均延迟延迟,该语法在未经重复的情况下依次输出结果。

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源