一项关于生成模型评估的研究

论文标题

一项关于生成模型评估的研究

A Study on the Evaluation of Generative Models

论文作者

Betzalel, Eyal, Penso, Coby, Navon, Aviv, Fetaya, Ethan

论文摘要

近年来，隐含的生成模型（例如生成对抗网络和扩散模型）已变得普遍。虽然这些模型确实显示出了显着的结果，但评估其性能是具有挑战性的。这个问题对于推动研究并从随机噪声中确定有意义的收益至关重要。当前，启发式指标（例如INCEPTION评分（IS）和特雷希特（Frechet Inception）距离（FID）是最常见的评估指标，但是它们所测量的内容尚不完全清楚。此外，关于他们的分数实际有多有意义的问题。在这项工作中，我们通过生成高质量的合成数据集来研究生成模型的评估指标，我们可以在该数据集中估算经典指标以进行比较。我们的研究表明，尽管FID和与几个F-Diverence确实相关，但它们的近距离模型的排名可能会差异很大，因此在用于Fain Graining比较时，它们有问题。我们进一步使用了这种实验环境来研究哪些评估度量与我们的概率指标相关。最后，我们研究用于FID等指标的基本功能。

Implicit generative models, which do not return likelihood values, such as generative adversarial networks and diffusion models, have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the Inception score (IS) and Frechet Inception Distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably making them problematic when used for fain-grained comparison. We further used this experimental setting to study which evaluation metric best correlates with our probabilistic metrics. Lastly, we look into the base features used for metrics such as FID.

下载PDF全文

下载文献需遵守相关版权规定

论文标题