论文标题
通过演示编程:用于交互学习标签功能的框架
Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions
论文作者
论文摘要
数据编程是一种程序化弱监督方法,可有效策划大规模标记的培训数据。但是,编写数据程序(标签功能)需要编程素养和域专业知识。许多主题专家既没有编程能力,也没有时间有效地编写数据程序。此外,无论一个人在编码或机器学习方面的专业知识如何,通过列举规则和阈值来将域专业知识转移到标签功能中,不仅耗时,而且本质上很困难。在这里,我们提出了一个新的框架,即演示(DPBD)的数据编程,以使用用户的交互式演示生成标签规则。 DPBD旨在减轻用户编写标签功能的负担,使他们能够专注于高级语义,例如识别标签任务的相关信号。我们使用luler(一种交互式系统,可以通过在文档示例中使用用户的跨度级注释来综合标记规则进行文档分类规则,从而合成文档分类规则。我们通过对10位数据科学家进行的用户研究进行比较,将统治者与传统的数据编程进行了比较,为情感和垃圾邮件分类任务创建标签功能。我们发现,标尺更易于使用和学习,并提供更高的总体满意度,同时提供了与传统数据编程相当的歧视模型性能。
Data programming is a programmatic weak supervision approach to efficiently curate large-scale labeled training data. Writing data programs (labeling functions) requires, however, both programming literacy and domain expertise. Many subject matter experts have neither programming proficiency nor time to effectively write data programs. Furthermore, regardless of one's expertise in coding or machine learning, transferring domain expertise into labeling functions by enumerating rules and thresholds is not only time consuming but also inherently difficult. Here we propose a new framework, data programming by demonstration (DPBD), to generate labeling rules using interactive demonstrations of users. DPBD aims to relieve the burden of writing labeling functions from users, enabling them to focus on higher-level semantics such as identifying relevant signals for labeling tasks. We operationalize our framework with Ruler, an interactive system that synthesizes labeling rules for document classification by using span-level annotations of users on document examples. We compare Ruler with conventional data programming through a user study conducted with 10 data scientists creating labeling functions for sentiment and spam classification tasks. We find that Ruler is easier to use and learn and offers higher overall satisfaction, while providing discriminative model performances comparable to ones achieved by conventional data programming.