功能归因的不可能定理

论文标题

功能归因的不可能定理

Impossibility Theorems for Feature Attribution

论文作者

Bilodeau, Blair, Jaques, Natasha, Koh, Pang Wei, Kim, Been

论文摘要

尽管有可解释的方法可以产生合理的解释，但该领域在经验上也看到了许多这种方法的失败案例。鉴于这些结果，对于从业者来说，如何使用这些方法并以原则性的方式进行选择，目前尚不清楚。在本文中，我们表明，对于中等丰富的模型类（很容易被神经网络满足），任何完整且线性的特征归因方法（例如，集成梯度和外形）可能会因推断模型行为而随机猜测而无法改善。我们的结果适用于常见的终端任务，例如表征本地模型行为，识别虚假特征和算法追索。我们工作中的一个收获是具体定义终点任务的重要性：一旦定义了终点任务，重复模型评估的简单而直接的方法就可以胜过许多其他复杂的功能属性方法。

Despite a sea of interpretability methods that can produce plausible explanations, the field has also empirically seen many failure cases of such methods. In light of these results, it remains unclear for practitioners how to use these methods and choose between them in a principled way. In this paper, we show that for moderately rich model classes (easily satisfied by neural networks), any feature attribution method that is complete and linear -- for example, Integrated Gradients and SHAP -- can provably fail to improve on random guessing for inferring model behaviour. Our results apply to common end-tasks such as characterizing local model behaviour, identifying spurious features, and algorithmic recourse. One takeaway from our work is the importance of concretely defining end-tasks: once such an end-task is defined, a simple and direct approach of repeated model evaluations can outperform many other complex feature attribution methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题