信息性初始化和内核选择可改善生物序列的T-SNE

论文标题

信息性初始化和内核选择可改善生物序列的T-SNE

Informative Initialization and Kernel Selection Improves t-SNE for Biological Sequences

论文作者

Chourasia, Prakash, Ali, Sarwan, Patterson, Murray

论文摘要

T-分布的随机邻居嵌入（T-SNE）是一种通过将每个点映射到低维（LD）空间（通常是二维）来解释高维（HD）数据的方法。它试图保留数据的结构。 T-SNE算法的重要组成部分是初始化过程，该过程从LD矢量的随机初始化开始。然后，将更新此初始矢量中的点，以最大程度地减少损失函数（KL DiverGence）使用梯度下降。这引发了可比的积分，可以互相吸引，同时将不同的点分开。我们认为，默认情况下，这些算法应采用某种形式的信息初始化。 T-SNE的另一个必不可少的组成部分是使用核基质，一个相似性矩阵包含序列之间的成对距离。对于基于T-SNE的可视化，默认情况下，高斯内核是在文献中使用的。但是，我们表明内核选择也可以在T-SNE的性能中发挥至关重要的作用。在这项工作中，我们使用四个不同的集合评估了T-SNE的性能，其中三个是从各种来源获得的生物序列（核苷酸，蛋白质等）数据集的三个集合，例如SARS-COV-2病毒序列的众所周知的GISAID数据库。我们对这些替代方案进行主观和客观评估。我们使用所得的T-SNE图和K-Ary邻里协议（K-ANA）评估和比较所提出的方法与基础线。我们表明，通过使用不同的技术，例如知情初始化和内核矩阵选择，T-SNE的性能明显更好。此外，我们表明，T-SNE还需要更少的迭代才能更快地收敛，并更加智能的初始化。

The t-distributed stochastic neighbor embedding (t- SNE) is a method for interpreting high dimensional (HD) data by mapping each point to a low dimensional (LD) space (usually two-dimensional). It seeks to retain the structure of the data. An important component of the t-SNE algorithm is the initialization procedure, which begins with the random initialization of an LD vector. Points in this initial vector are then updated to minimize the loss function (the KL divergence) iteratively using gradient descent. This leads comparable points to attract one another while pushing dissimilar points apart. We believe that, by default, these algorithms should employ some form of informative initialization. Another essential component of the t-SNE is using a kernel matrix, a similarity matrix comprising the pairwise distances among the sequences. For t-SNE-based visualization, the Gaussian kernel is employed by default in the literature. However, we show that kernel selection can also play a crucial role in the performance of t-SNE. In this work, we assess the performance of t-SNE with various alternative initialization methods and kernels, using four different sets, out of which three are biological sequences (nucleotide, protein, etc.) datasets obtained from various sources, such as the well-known GISAID database for sequences of the SARS- CoV-2 virus. We perform subjective and objective assessments of these alternatives. We use the resulting t-SNE plots and k- ary neighborhood agreement (k-ANA) to evaluate and compare the proposed methods with the baselines. We show that by using different techniques, such as informed initialization and kernel matrix selection, that t-SNE performs significantly better. Moreover, we show that t-SNE also takes fewer iterations to converge faster with more intelligent initialization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题