评估波斯令牌

论文标题

评估波斯令牌

Evaluating Persian Tokenizers

论文作者

Kamali, Danial, Janfada, Behrooz, Shenasa, Mohammad Ebrahim, Minaei-Bidgoli, Behrouz

论文摘要

令牌化在词汇分析过程中起着重要作用。令牌成为其他自然语言处理任务的输入，例如语义解析和语言建模。由于波斯的特殊情况，例如半空间，波斯语中的自然语言处理是具有挑战性的。因此，对于波斯人来说，具有精确的代币仪至关重要。本文通过介绍最广泛使用的令牌来提供一项新颖的工作，并使用与预先标记的波斯依赖性数据集的简单算法比较和评估其在波斯文本上的性能。在用F1分数评估引物器后，FARSI动词和HAZM的混合版本具有有界形式的固定，表现最好，F1得分为98.97％。

Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing tasks, like semantic parsing and language modeling. Natural Language Processing in Persian is challenging due to Persian's exceptional cases, such as half-spaces. Thus, it is crucial to have a precise tokenizer for Persian. This article provides a novel work by introducing the most widely used tokenizers for Persian and comparing and evaluating their performance on Persian texts using a simple algorithm with a pre-tagged Persian dependency dataset. After evaluating tokenizers with the F1-Score, the hybrid version of the Farsi Verb and Hazm with bounded morphemes fixing showed the best performance with an F1 score of 98.97%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题