探索使用多目标优化的反事实解释中的合理性，变化强度和对抗力量之间的权衡

论文标题

探索使用多目标优化的反事实解释中的合理性，变化强度和对抗力量之间的权衡

Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization

论文作者

Del Ser, Javier, Barredo-Arrieta, Alejandro, Díaz-Rodríguez, Natalia, Herrera, Francisco, Holzinger, Andreas

论文摘要

深度学习模型在涉及复杂数据的任务中的重要性有一个广泛的共识。通常，在关注人为领域的决策透明度时，需要对这些模型有足够的理解。除其他解释性技术外，可以通过使用反事实来实现可信度，就像人类熟悉未知过程的方式：通过了解产出变化的假设情况。在这项工作中，我们认为自动反事实的产生应该考虑产生的对抗实例的几个方面，而不仅仅是它们的对抗能力。为此，我们提出了一个新颖的框架，用于生成反事实示例，该框架将其目标提出为多目标优化问题，平衡了三个不同的目标：1）合理性，即根据输入数据的分布，反事实的可能性很可能； 2）对原始输入的变化强度； 3）对抗力量，即，反事实引起的模型输出的可变性。该框架从目标模型中偏离审核，并使用生成对抗网络来建模输入数据的分布，以及一个多目标求解器，以发现这些目标之间的反事实平衡。该框架的实用程序在包括图像和三维数据的六个分类任务上展示。该实验验证了框架是否揭示了符合直觉，增加用户的信任度并导致进一步见解的反事实，例如检测偏见和数据错误陈述。

There is a broad consensus on the importance of deep learning models in tasks involving complex data. Often, an adequate understanding of these models is required when focusing on the transparency of decisions in human-critical applications. Besides other explainability techniques, trustworthiness can be achieved by using counterfactuals, like the way a human becomes familiar with an unknown process: by understanding the hypothetical circumstances under which the output changes. In this work we argue that automated counterfactual generation should regard several aspects of the produced adversarial instances, not only their adversarial capability. To this end, we present a novel framework for the generation of counterfactual examples which formulates its goal as a multi-objective optimization problem balancing three different objectives: 1) plausibility, i.e., the likeliness of the counterfactual of being possible as per the distribution of the input data; 2) intensity of the changes to the original input; and 3) adversarial power, namely, the variability of the model's output induced by the counterfactual. The framework departs from a target model to be audited and uses a Generative Adversarial Network to model the distribution of input data, together with a multi-objective solver for the discovery of counterfactuals balancing among these objectives. The utility of the framework is showcased over six classification tasks comprising image and three-dimensional data. The experiments verify that the framework unveils counterfactuals that comply with intuition, increasing the trustworthiness of the user, and leading to further insights, such as the detection of bias and data misrepresentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题