论文标题
印度语言的多语言平行语料库收集工作
A Multilingual Parallel Corpora Collection Effort for Indian Languages
论文作者
论文摘要
我们提出句子在10种印度语言上与平行语料库保持一致 - 印地语,泰卢固语,泰米尔语,马拉雅拉姆语,古吉拉特语,乌尔都语,孟加拉语,孟加拉语,奥里亚,马拉地语,旁遮普语和英语 - 其中许多被归类为低资源。该语料库是从在线资源编译的,这些来源具有跨语言共享的内容。呈现的公司大大扩展了目前的资源,这些资源要么不够大,要么仅限于特定领域(例如健康)。我们还提供了一个独立在线资源编译的单独的测试语料库,可以独立用于验证10种印度语言的性能。除了,我们还报告了使用基于深神经网络的方法在机器翻译和跨语言检索方面启用的工具构建此类语料库的方法。
We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.