与平行语料库的原则释义生成

论文标题

与平行语料库的原则释义生成

Principled Paraphrase Generation with Parallel Corpora

论文作者

Ormazabal, Aitor, Artetxe, Mikel, Soroa, Aitor, Labaka, Gorka, Agirre, Eneko

论文摘要

往返机翻译（MT）是释义生成的流行选择，该释义可利用容易获得的并行语料库进行监督。在本文中，我们正式化了这种方法引起的隐式相似性函数，并表明它容易受到共享单个模棱两可的翻译的非纵相对。基于这些见解，我们设计了一种替代性相似性度量，可以通过要求整个翻译分布匹配并通过信息瓶颈方法来减轻此问题来减轻此问题。我们的方法将对抗性术语纳入MT培训中，以学习对参考翻译尽可能多的信息进行编码的表示形式，同时将有关输入的信息尽可能少。可以通过从该表示形式中解码回源来生成释义，而无需生成枢轴翻译。除了比往返往返MT的原则性和高效效率外，我们的方法还提供了可调参数，以控制Fidelity多样性的权衡，并在我们的实验中获得更好的结果。

Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match, and implement a relaxation of it through the Information Bottleneck method. Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible, while keeping as little information about the input as possible. Paraphrases can be generated by decoding back to the source from this representation, without having to generate pivot translations. In addition to being more principled and efficient than round-trip MT, our approach offers an adjustable parameter to control the fidelity-diversity trade-off, and obtains better results in our experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题