域适应性主成分分析：用于学习的基础线性方法与分布数据的数据

论文标题

域适应性主成分分析：用于学习的基础线性方法与分布数据的数据

Domain Adaptation Principal Component Analysis: base linear method for learning with out-of-distribution data

论文作者

Mirkes, Evgeny M, Bac, Jonathan, Fouché, Aziz, Stasenko, Sergey V., Zinovyev, Andrei, Gorban, Alexander N.

论文摘要

域的适应性是现代机器学习中流行的范式，旨在解决标记的培训和验证数据集（源域）和潜在的大型未标记数据集（目标域）之间的分歧（或移位）问题。该任务是将两个数据集嵌入红色的空间中，其中源数据集提供了训练信息，而源和目标之间的差异则最小化。最受欢迎的领域适应解决方案基于训练神经网络，这些神经网络结合了分类和对抗性学习模块，经常使它们既渴望数据，又难以训练。我们提出了一种称为域适应性主成分分析（DAPCA）的方法，该方法标识了线性减少的数据表示，可用于求解域适应任务。 DAPCA算法在数据点对之间引入了正权重和负权重，并概括了主成分分析的监督扩展。 DAPCA是一种迭代算法，可以在每次迭代中解决一个简单的二次优化问题。保证了算法的收敛性，并且迭代次数在实践中很少。我们验证了有关以前提出的基准测试的建议算法，以解决域适应任务。我们还展示了使用DAPCA在生物医学应用中分析单细胞OMICS数据集的好处。总体而言，DAPCA可以在许多机器学习应用程序中作为实用的预处理步骤，从而考虑到源和目标域之间可能的差异，从而导致数据集表示减少。

Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets red into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing the single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题