论文标题
约束表示会产生知道他们不知道的模型
Constraining Representations Yields Models That Know What They Don't Know
论文作者
论文摘要
神经网络的一种众所周知的故障模式是,它们可以自信地返回错误的预测。当用例与训练环境略有不同时,和/或在对手存在下,这种不安全的行为尤其频繁。这项工作提出了一个新颖的方向,以广泛的一般方式解决这些问题:对模型的内部激活模式施加阶级感知的约束。具体来说,我们将每个类分配给独特的,固定的,随机生成的二进制向量 - 以下称为类代码 - 并训练模型,以便其交叉深度激活模式根据输入样本的类预测适当的类代码。所得的预测因子被称为总激活分类器(TAC),TAC可以从头开始训练,也可以用可忽略的成本作为薄的附加组件,在冷冻,预先训练的神经网络之上。除默认的UNTAC'ED预测头外,TAC的激活模式与最接近有效的代码之间的距离是额外的置信度得分。在附加情况下,原始的神经网络的推理头完全不受影响(因此其准确性保持不变),但是现在我们可以选择在确定在假设生产工作流程中采取哪种行动时使用TAC自身的信心和预测。特别是,我们表明TAC严格改善了从允许拒绝/推迟的模型中得出的值。我们提供了进一步的经验证据,表明TAC在多种类型的体系结构和数据模式上都很好地运作,并且至少与从现有模型得出的最先进的替代置信分数一样好。
A well-known failure mode of neural networks is that they may confidently return erroneous predictions. Such unsafe behaviour is particularly frequent when the use case slightly differs from the training context, and/or in the presence of an adversary. This work presents a novel direction to address these issues in a broad, general manner: imposing class-aware constraints on a model's internal activation patterns. Specifically, we assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code - and train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class. The resulting predictors are dubbed Total Activation Classifiers (TAC), and TACs may either be trained from scratch, or used with negligible cost as a thin add-on on top of a frozen, pre-trained neural network. The distance between a TAC's activation pattern and the closest valid code acts as an additional confidence score, besides the default unTAC'ed prediction head's. In the add-on case, the original neural network's inference head is completely unaffected (so its accuracy remains the same) but we now have the option to use TAC's own confidence and prediction when determining which course of action to take in an hypothetical production workflow. In particular, we show that TAC strictly improves the value derived from models allowed to reject/defer. We provide further empirical evidence that TAC works well on multiple types of architectures and data modalities and that it is at least as good as state-of-the-art alternative confidence scores derived from existing models.