论文标题

通过批量特征值匹配分析估计协方差矩阵中峰值特征值的数量

Estimation of the number of spiked eigenvalues in a covariance matrix by bulk eigenvalue matching analysis

论文作者

Ke, Zheng Tracy, Ma, Yucong, Lin, Xihong

论文摘要

在高维数据分析中,尖刺的协方差模型已越来越受欢迎。一个基本问题是确定峰值特征值的数量,$ k $。为了估计$ k $,大多数关注都集中在使用样品协方差矩阵的$ top $ eigenvalues上,并且对利用$ bulk $ eigenvalues估算$ k $的适当方法几乎没有调查。我们提出了一种原则性的方法,将批量特征值纳入$ k $。我们的方法对残余协方差矩阵实施了一个工作模型,该模型假定为对角矩阵,其条目是从伽马分布中绘制的。在此模型下,批量特征值渐近地接近固定参数分布的分位数。这促使我们提出了一个两步方法:第一步使用批量特征值来估计此分布的参数,第二步利用这些参数来帮助估计$ k $。由此产生的估算器$ \ hat {k} $在大量批量特征值中汇总了信息。我们在标准的尖峰协方差模型下显示了$ \ hat {k} $的一致性。我们还提出了$ k $的置信区间估算。我们广泛的仿真研究表明,所提出的方法是强大的,并且在各种情况下都胜过现有方法。我们将提出的方法应用于肺癌微阵列数据集和1000个基因组数据集的分析。

The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, $K$. For estimation of $K$, most attention has focused on the use of $top$ eigenvalues of sample covariance matrix, and there is little investigation into proper ways of utilizing $bulk$ eigenvalues to estimate $K$. We propose a principled approach to incorporating bulk eigenvalues in the estimation of $K$. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of $K$. The resulting estimator $\hat{K}$ aggregates information in a large number of bulk eigenvalues. We show the consistency of $\hat{K}$ under a standard spiked covariance model. We also propose a confidence interval estimate for $K$. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray data set and the 1000 Genomes data set.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源