论文标题

BN-HTRD:文档级别脱机的基准数据集孟加拉语手写文本识别(HTR)和行分段

BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR) and Line Segmentation

论文作者

Rahman, Md. Ataur, Tabassum, Nazifa, Paul, Mitu, Pal, Riya, Islam, Mohammad Khairul

论文摘要

我们从孟加拉脚本的图像中介绍了一个新的数据集,用于脱机手写文本识别(HTR),其中包含单词,行和文档级注释。 BN-HTRD数据集基于BBC Bangla新闻语料库,旨在充当地面真相文字。这些文本随后被用来生成由人们用笔迹填写的注释。我们的数据集包括大约150位不同作家制作的手写页面的788张图像。它可以用作各种手写分类任务的基础,例如端到端文档识别,单词介绍,单词或行细分等。我们还提出了一个计划,以无监督的方式将孟加拉语手写的文档图像细分为相应的线条。我们的行分割方法可以解决不同写作样式所涉及的可变性,从而准确地分割了曲线性质的复杂手写文本行。除了一堆预处理和形态学操作外,霍夫线和圆形变换均被用于区分不同的线性组件。为了将这些组件排列到相应的线上,我们遵循了一种无监督的聚类方法。我们分割技术的平均成功率在FM指标(类似于F量)方面为81.57%,平均平均精度(MAP)为0.547。

We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus, meant to act as ground truth texts. These texts were subsequently used to generate the annotations that were filled out by people with their handwriting. Our dataset includes 788 images of handwritten pages produced by approximately 150 different writers. It can be adopted as a basis for various handwriting classification tasks such as end-to-end document recognition, word-spotting, word or line segmentation, and so on. We also propose a scheme to segment Bangla handwritten document images into corresponding lines in an unsupervised manner. Our line segmentation approach takes care of the variability involved in different writing styles, accurately segmenting complex handwritten text lines of curvilinear nature. Along with a bunch of pre-processing and morphological operations, both Hough line and circle transforms were employed to distinguish different linear components. In order to arrange those components into their corresponding lines, we followed an unsupervised clustering approach. The average success rate of our segmentation technique is 81.57% in terms of FM metrics (similar to F-measure) with a mean Average Precision (mAP) of 0.547.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源