论文标题
哈佛大学USPTO专利数据集:大规模,结构良好且多功能专利申请的专利申请
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications
论文作者
论文摘要
创新是经济和社会发展的主要驱动力,有关许多创新的信息嵌入了专利和专利申请的半结构化数据中。尽管在专利数据中表达的创新的影响和新颖性很难通过传统手段来衡量,但ML提供了一套有希望的技术,用于评估新颖性,汇总贡献和嵌入语义。在本文中,我们介绍了哈佛大学USPTO专利数据集(HUPD),这是一个大规模,结构良好的和多用途的英语专利专利申请,该专利申请在2004年至2018年之间提交给美国专利和商标办公室(USPTO)。与以前在NLP中提出的专利数据集不同,HUPD包含了发明人提交的专利申请版本(不是授予专利的最终版本),其中首次使用NLP方法在申请申请时可以研究专利性。它在包含丰富的结构化元数据以及专利申请的文本旁边也很新颖:通过提供每个应用程序的元数据及其所有文本字段,数据集使研究人员能够执行一组新的NLP任务,这些任务利用结构化的协变量中利用变异的变化。作为有关HUPD的研究类型的案例研究,我们向NLP社区(即专利决策的二元分类)介绍了一项新任务。我们还显示,数据集中提供的结构化元数据使我们能够对此任务进行概念转移的明确研究。最后,我们演示了如何将HUPD用于三个其他任务:专利主题领域的多类分类,语言建模和摘要。
Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.