从结构化文档中的相似性和信息提取中学习

论文标题

从结构化文档中的相似性和信息提取中学习

Learning from similarity and information extraction from structured documents

论文作者

Holeček, Martin

论文摘要

文档处理的自动化正在引起最近的关注，因为通过改进的方法和硬件减少手动工作的巨大潜力。神经网络以前已经成功应用了 - 即使它们仅在到目前为止拥有数百个文档的相对较小的数据集上进行了培训。为了成功探索深度学习技术并改善信息提取结果，已编译了25万个文档的数据集，匿名化并作为这项工作的一部分出版。我们将扩大以前的工作，证明卷积，图形卷积和自我注意力可以共同起作用，并利用结构化文档中存在的所有信息。将完全训练的方法更进一步，我们现在将设计和研究使用暹罗网络，相似性概念，一次性学习和上下文/记忆意识的各种方法。目的是改善巨大的现实文档数据集上的每个字分类的微F1。结果验证了以下假设：可训练访问类似（但仍然不同）的页面及其已知的目标信息可改善信息的提取。此外，实验证实，所有提出的体系结构零件都是击败先前结果所必需的。最佳模型将以前的最新结果提高了F1分数8.25增益。提供定性分析以验证新模型是否针对所有目标类别的性能更好。此外，还揭示了有关某些体系结构表现不佳的原因的多个结构观察。所有源代码，参数和实施详细信息都与数据集一起发布，以期突破研究边界，因为这项工作中使用的所有技术都不是特定于问题的，并且可以推广到其他任务和上下文。

The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. Neural networks have been successfully applied before - even though they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve the information extraction results, a dataset with more than twenty-five thousand documents has been compiled, anonymized and is published as a part of this work. We will expand our previous work where we proved that convolutions, graph convolutions and self-attention can work together and exploit all the information present in a structured document. Taking the fully trainable method one step further, we will now design and examine various approaches to using siamese networks, concepts of similarity, one-shot learning and context/memory awareness. The aim is to improve micro F1 of per-word classification on the huge real-world document dataset. The results verify the hypothesis that trainable access to a similar (yet still different) page together with its already known target information improves the information extraction. Furthermore, the experiments confirm that all proposed architecture parts are all required to beat the previous results. The best model improves the previous state-of-the-art results by an 8.25 gain in F1 score. Qualitative analysis is provided to verify that the new model performs better for all target classes. Additionally, multiple structural observations about the causes of the underperformance of some architectures are revealed. All the source codes, parameters and implementation details are published together with the dataset in the hope to push the research boundaries since all the techniques used in this work are not problem-specific and can be generalized for other tasks and contexts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题