论文标题
商业文档信息提取:迈向实用基准
Business Document Information Extraction: Towards Practical Benchmarks
论文作者
论文摘要
从半结构化文档中提取信息对于无摩擦企业对企业(B2B)通信至关重要。尽管已经研究了与文档信息提取(IE)有关的机器学习问题数十年,但许多常见的问题定义和基准并不能反映特定于领域的方面和自动化B2B文档通信的实际需求。我们回顾文档的景观IE问题,数据集和基准。我们强调了共同定义中缺少的实际方面,并定义了关键信息本地化和提取(KILE)和行项目识别(LIR)问题。由于其内容通常受到法律保护或敏感,因此缺乏用于半结构化业务文档的文档IE的相关数据集和基准。我们讨论了包括合成数据在内的可用文档的潜在来源。
Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.