论文标题
Lawngnli:从短到长篇小说和基于含义的检索的长期概括的长期基准
LawngNLI: A Long-Premise Benchmark for In-Domain Generalization from Short to Long Contexts and for Implication-Based Retrieval
论文作者
论文摘要
自然语言推论趋向于研究句子级别以上的环境。一个重要的应用领域是法律:过去的案件通常不预言它们如何应用于新情况,并且必须推断出含义。本文介绍了Lawngnli,该文章是由美国法律意见建立的,其自动标签具有较高的人为验证精度。前提是漫长而多的。实验显示了两种用例。首先,Lawngnli可以基准从短到长篇小说中的内域概括。目前尚不清楚是否真的需要构建大型长期NLI数据集:通过使用简短的场所进行微调,可以在长距离近距离效果。如果没有多元性,基准将无法区分长处和长数据集之间的域转移与域的变化。相比之下,我们的漫长和短前提共享相同的示例和域。模型使用过去的几个NLI数据集和/或我们的短前提进行了微调。因此,对于至少某些域(例如我们的域),需要大规模的长期数据集。其次,Lawngnli可以基于隐含的检索进行基准测试。查询需要或与目标文件相矛盾,使用户可以在争论和证据之间移动。领先的检索模型在Lawngnli衍生的检索任务上执行合理的零拍摄。我们比较了重新排列的不同系统,包括使用修改后的Lawngnli或过去NLI数据集进行了微调的词汇重叠和跨编码器。 Lawngnli可以训练和测试系统以基于含义的案例检索和论证。
Natural language inference has trended toward studying contexts beyond the sentence level. An important application area is law: past cases often do not foretell how they apply to new situations and implications must be inferred. This paper introduces LawngNLI, constructed from U.S. legal opinions with automatic labels with high human-validated accuracy. Premises are long and multigranular. Experiments show two use cases. First, LawngNLI can benchmark for in-domain generalization from short to long contexts. It has remained unclear if large-scale long-premise NLI datasets actually need to be constructed: near-top performance on long premises could be achievable by fine-tuning using short premises. Without multigranularity, benchmarks cannot distinguish lack of fine-tuning on long premises versus domain shift between short and long datasets. In contrast, our long and short premises share the same examples and domain. Models fine-tuned using several past NLI datasets and/or our short premises fall short of top performance on our long premises. So for at least certain domains (such as ours), large-scale long-premise datasets are needed. Second, LawngNLI can benchmark for implication-based retrieval. Queries are entailed or contradicted by target documents, allowing users to move between arguments and evidence. Leading retrieval models perform reasonably zero shot on a LawngNLI-derived retrieval task. We compare different systems for re-ranking, including lexical overlap and cross-encoders fine-tuned using a modified LawngNLI or past NLI datasets. LawngNLI can train and test systems for implication-based case retrieval and argumentation.