数据平衡以提高口语中低频课程的性能

论文标题

数据平衡以提高口语中低频课程的性能

Data balancing for boosting performance of low-frequency classes in Spoken Language Understanding

论文作者

Gaspers, Judith, Do, Quynh, Triefenbach, Fabian

论文摘要

尽管数据失衡在现实世界中的语言理解（SLU）应用中变得越来越普遍，但在文献中尚未对其进行广泛的研究。据我们所知，本文介绍了第一个有关处理SLU数据不平衡的系统研究。特别是，我们讨论了现有数据平衡技术在SLU中的应用，并提出了一个多任务SLU模型，以进行意图分类和插槽填充。旨在避免过度拟合的旨在，在我们的数据平衡方法中，通过辅助任务间接利用，该任务使用类平衡的批处理生成器和（可能）合成数据。 Our results on a real-world dataset indicate that i) our proposed model can boost performance on low frequency intents significantly while avoiding a potential performance decrease on the head intents, ii) synthetic data are beneficial for bootstrapping new intents when realistic data are not available, but iii) once a certain amount of realistic data becomes available, using synthetic data in the auxiliary task only yields better performance than adding them to the primary task training data, and iv) in平衡意图分布的联合培训场景单独改善意图分类，而且可以填充插槽表现。

Despite the fact that data imbalance is becoming more and more common in real-world Spoken Language Understanding (SLU) applications, it has not been studied extensively in the literature. To the best of our knowledge, this paper presents the first systematic study on handling data imbalance for SLU. In particular, we discuss the application of existing data balancing techniques for SLU and propose a multi-task SLU model for intent classification and slot filling. Aiming to avoid over-fitting, in our model methods for data balancing are leveraged indirectly via an auxiliary task which makes use of a class-balanced batch generator and (possibly) synthetic data. Our results on a real-world dataset indicate that i) our proposed model can boost performance on low frequency intents significantly while avoiding a potential performance decrease on the head intents, ii) synthetic data are beneficial for bootstrapping new intents when realistic data are not available, but iii) once a certain amount of realistic data becomes available, using synthetic data in the auxiliary task only yields better performance than adding them to the primary task training data, and iv) in a joint training scenario, balancing the intent distribution individually improves not only intent classification but also slot filling performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题