论文标题
荷兰图谱分辨率的二进制和多任务分类模型:DIE/DAT预测
Binary and Multitask Classification Model for Dutch Anaphora Resolution: Die/Dat Prediction
论文作者
论文摘要
荷兰代词“ die”和“ dat”的正确使用是荷兰人的本地和非本地人说话者的绊脚石,这是由于句法函数的多样性以及对先前性别和数字的依赖性。利用了对神经背景依赖性DT-Mistake校正模型进行的先前研究(Heyman等,2018),本研究构建了荷兰示范性和相对代词分辨率的第一个神经网络模型,该模型专门侧重于这两个代词的校正和部分语言预测。 Two separate datasets are built with sentences obtained from, respectively, the Dutch Europarl corpus (Koehn 2015) - which contains the proceedings of the European Parliament from 1996 to the present - and the SoNaR corpus (Oostdijk et al. 2013) - which contains Dutch texts from a variety of domains such as newspapers, blogs and legal texts.首先,二进制分类模型仅预测正确的“模具”或“ dat”。具有双向长期记忆体系结构的分类器可实现84.56%的精度。其次,多任务分类模型同时预测了正确的“ die”或“ dat”及其词性标签。该模型包含句子和上下文编码器与双向长期短期内存体系结构的组合,导致DIE/DAT预测的精度为88.63%,对于部分语音的预测,精度为87.73%。更均衡的数据,较大的单词嵌入,额外的双向长期短期内存层和集成的言论一部分知识对DIE/DAT预测性能产生积极影响,而上下文Encoder Architecture则提高了语音的预测性能。这项研究显示出令人鼓舞的结果,并可以作为对荷兰图解的机器学习模型的未来研究的起点。
The correct use of Dutch pronouns 'die' and 'dat' is a stumbling block for both native and non-native speakers of Dutch due to the multiplicity of syntactic functions and the dependency on the antecedent's gender and number. Drawing on previous research conducted on neural context-dependent dt-mistake correction models (Heyman et al. 2018), this study constructs the first neural network model for Dutch demonstrative and relative pronoun resolution that specifically focuses on the correction and part-of-speech prediction of these two pronouns. Two separate datasets are built with sentences obtained from, respectively, the Dutch Europarl corpus (Koehn 2015) - which contains the proceedings of the European Parliament from 1996 to the present - and the SoNaR corpus (Oostdijk et al. 2013) - which contains Dutch texts from a variety of domains such as newspapers, blogs and legal texts. Firstly, a binary classification model solely predicts the correct 'die' or 'dat'. The classifier with a bidirectional long short-term memory architecture achieves 84.56% accuracy. Secondly, a multitask classification model simultaneously predicts the correct 'die' or 'dat' and its part-of-speech tag. The model containing a combination of a sentence and context encoder with both a bidirectional long short-term memory architecture results in 88.63% accuracy for die/dat prediction and 87.73% accuracy for part-of-speech prediction. More evenly-balanced data, larger word embeddings, an extra bidirectional long short-term memory layer and integrated part-of-speech knowledge positively affects die/dat prediction performance, while a context encoder architecture raises part-of-speech prediction performance. This study shows promising results and can serve as a starting point for future research on machine learning models for Dutch anaphora resolution.