论文标题
pytail:在线数据中,与人类的NLP模型对NLP模型的交互式和增量学习
PyTAIL: Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data
论文作者
论文摘要
在线数据流使培训机器学习模型由于分配变化和随着时间的推移而出现的新模式而变得艰难。对于使用基于词典和规则的功能集合的自然语言处理(NLP)任务,将这些功能调整到不断变化的数据很重要。为了应对这一挑战,我们介绍了Python库Pytail,该图书馆允许循环方法的人积极训练NLP模型。 Pytail增强了通用的主动学习,这仅提出新的实例来标记标签,还提出了诸如规则和词典诸如标签的新功能。此外,在训练模型时,Pytail足够灵活,可以接受,拒绝或更新规则和词典。最后,我们在现有的社交媒体基准数据集上模拟Pytail的性能进行文本分类。我们比较了这些基准测试的各种主动学习策略。该模型以训练数据的10%缩小了差距。最后,我们还强调了跟踪评估指标在测试数据集旁边的剩余数据(尚未与主动学习合并)上的重要性。这突出了该模型准确注释其余数据集的有效性,该数据集特别适合大型未标记语料库进行批量处理。 Pytail将在https://github.com/socialmediaie/pytail上找到。
Online data streams make training machine learning models hard because of distribution shift and new patterns emerging over time. For natural language processing (NLP) tasks that utilize a collection of features based on lexicons and rules, it is important to adapt these features to the changing data. To address this challenge we introduce PyTAIL, a python library, which allows a human in the loop approach to actively train NLP models. PyTAIL enhances generic active learning, which only suggests new instances to label by also suggesting new features like rules and lexicons to label. Furthermore, PyTAIL is flexible enough for users to accept, reject, or update rules and lexicons as the model is being trained. Finally, we simulate the performance of PyTAIL on existing social media benchmark datasets for text classification. We compare various active learning strategies on these benchmarks. The model closes the gap with as few as 10% of the training data. Finally, we also highlight the importance of tracking evaluation metric on remaining data (which is not yet merged with active learning) alongside the test dataset. This highlights the effectiveness of the model in accurately annotating the remaining dataset, which is especially suitable for batch processing of large unlabelled corpora. PyTAIL will be available at https://github.com/socialmediaie/pytail.