通过批量特征值匹配分析估计协方差矩阵中峰值特征值的数量

论文标题

通过批量特征值匹配分析估计协方差矩阵中峰值特征值的数量

Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis

论文作者

Ke, Zheng Tracy, Ma, Yucong, Lin, Xihong

论文摘要

在高维数据分析中，尖刺的协方差模型已越来越受欢迎。一个基本问题是确定峰值特征值的数量，$ k $。为了估计$ k $，大多数关注都集中在使用样品协方差矩阵的$ top $ eigenvalues上，并且对利用$ bulk $ eigenvalues估算$ k $的适当方法几乎没有调查。我们提出了一种原则性的方法，将批量特征值纳入$ k $。我们的方法对残余协方差矩阵实施了一个工作模型，该模型假定为对角矩阵，其条目是从伽马分布中绘制的。在此模型下，批量特征值渐近地接近固定参数分布的分位数。这促使我们提出了一个两步方法：第一步使用批量特征值来估计此分布的参数，第二步利用这些参数来帮助估计$ k $。由此产生的估算器$ \ hat {k} $在大量批量特征值中汇总了信息。我们在标准的尖峰协方差模型下显示了$ \ hat {k} $的一致性。我们还提出了$ k $的置信区间估算。我们广泛的仿真研究表明，所提出的方法是强大的，并且在各种情况下都胜过现有方法。我们将提出的方法应用于肺癌微阵列数据集和1000个基因组数据集的分析。

The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, $K$. For estimation of $K$, most attention has focused on the use of $top$ eigenvalues of sample covariance matrix, and there is little investigation into proper ways of utilizing $bulk$ eigenvalues to estimate $K$. We propose a principled approach to incorporating bulk eigenvalues in the estimation of $K$. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of $K$. The resulting estimator $\hat{K}$ aggregates information in a large number of bulk eigenvalues. We show the consistency of $\hat{K}$ under a standard spiked covariance model. We also propose a confidence interval estimate for $K$. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray data set and the 1000 Genomes data set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题