D3Former：用于增量学习的双重双蒸馏变压器

论文标题

D3Former：用于增量学习的双重双蒸馏变压器

D3Former: Debiased Dual Distilled Transformer for Incremental Learning

论文作者

Mohamed, Abdelrahman, Grandhe, Rushali, Joseph, K J, Khan, Salman, Khan, Fahad

论文摘要

在课堂增量学习（CIL）设置中，在每个学习阶段都将类组引入模型。目的是学习到目前为止观察到的所有类别的统一模型表现。鉴于视觉变压器（VIT）在常规分类设置中的最新流行，一个有趣的问题是研究其持续学习行为。在这项工作中，我们为CIL配音为$ \ textrm {d}^3 \ textrm {以前} $。提出的模型利用混合嵌套的VIT设计，以确保数据效率和可扩展性对小数据集和大数据集。与最近的基于VIT的CIL方法相反，我们的$ \ textrm {d}^3 \ textrm {以前} $在学习新任务并仍然适用于大量增量任务时，不会动态扩展其体系结构。 $ \ textrm {d}^3 \ textrm {以前} $的CIL行为的改善归功于VIT设计的两个基本变化。首先，我们将增量学习视为一个长尾分类问题，其中大多数新班级的样本远远超过了可用于旧课程的有限范例。为了避免对少数族裔的偏见，我们建议动态调整逻辑，以强调保留与旧任务相关的表示形式。其次，我们建议在学习跨任务的过程中保持空间注意图的配置。这有助于减少灾难性遗忘，通过限制模型以将注意力保留到最歧视区域上。 $ \ textrm {d}^3 \ textrm {以前} $在CIFAR-100，MNIST，SVHN和Imagenet数据集的增量版本上获得了有利的结果。代码可从https://tinyurl.com/d3former获得

In class incremental learning (CIL) setting, groups of classes are introduced to a model in each learning phase. The goal is to learn a unified model performant on all the classes observed so far. Given the recent popularity of Vision Transformers (ViTs) in conventional classification settings, an interesting question is to study their continual learning behaviour. In this work, we develop a Debiased Dual Distilled Transformer for CIL dubbed $\textrm{D}^3\textrm{Former}$. The proposed model leverages a hybrid nested ViT design to ensure data efficiency and scalability to small as well as large datasets. In contrast to a recent ViT based CIL approach, our $\textrm{D}^3\textrm{Former}$ does not dynamically expand its architecture when new tasks are learned and remains suitable for a large number of incremental tasks. The improved CIL behaviour of $\textrm{D}^3\textrm{Former}$ owes to two fundamental changes to the ViT design. First, we treat the incremental learning as a long-tail classification problem where the majority samples from new classes vastly outnumber the limited exemplars available for old classes. To avoid the bias against the minority old classes, we propose to dynamically adjust logits to emphasize on retaining the representations relevant to old tasks. Second, we propose to preserve the configuration of spatial attention maps as the learning progresses across tasks. This helps in reducing catastrophic forgetting by constraining the model to retain the attention on the most discriminative regions. $\textrm{D}^3\textrm{Former}$ obtains favorable results on incremental versions of CIFAR-100, MNIST, SVHN, and ImageNet datasets. Code is available at https://tinyurl.com/d3former

下载PDF全文

下载文献需遵守相关版权规定

论文标题