论文标题
在历史报纸页面的图像上处理局部查询
Processing topical queries on images of historical newspaper pages
论文作者
论文摘要
历史报纸是人类和社会科学的研究来源。但是,由于印刷品的质量低,除了某些文件的低质量照片之外,这些图像收集很难用机器读取。本文在历史报纸页面图像中介绍了主题导航系统的处理模型。一般过程由四个模块组成:文本子图像和文本提取,预处理和表示,诱导的主题提取和表示以及文档查看和检索接口的分割。描述了每个模块的算法和技术方法,并提供了有关涵盖28年范围的最初测试结果。
Historical newspapers are a source of research for the human and social sciences. However, these image collections are difficult to read by machine due to the low quality of the print, the lack of standardization of the pages in addition to the low quality photograph of some files. This paper presents the processing model of a topic navigation system in historical newspaper page images. The general procedure consists of four modules which are: segmentation of text sub-images and text extraction, preprocessing and representation, induced topic extraction and representation, and document viewing and retrieval interface. The algorithmic and technological approaches of each module are described and the initial test results about a collection covering a range of 28 years are presented.