论文标题
广义核两样本测试
Generalized Kernel Two-Sample Tests
论文作者
论文摘要
内核两样本测试已被广泛用于多元数据,以测试分布的平等性。但是,现有的测试基于将分布映射到复制的内核希尔伯特空间主要针对特定替代方案,并且由于维度的诅咒,数据的尺寸适中到高。我们提出了一个新的测试统计量,该统计量使用中等和高维的共同模式,并对现有内核的两样本测试实现了大量的功率改进。我们还提出了替代测试程序,该程序以低计算成本维持高功率,为大型数据集提供简单的现成工具。将新方法与在各种设置下的其他最新测试进行了比较,并显示出良好的性能。我们通过两种应用展示了新方法:使用分子的形状比较麝香和非毒物,以及连续几个月从约翰·肯尼迪机场开始的出租车旅行的比较。所有提出的方法均在r软件包中实现。
Kernel two-sample tests have been widely used for multivariate data to test equality of distributions. However, existing tests based on mapping distributions into a reproducing kernel Hilbert space mainly target specific alternatives and do not work well for some scenarios when the dimension of the data is moderate to high due to the curse of dimensionality. We propose a new test statistic that makes use of a common pattern under moderate and high dimensions and achieves substantial power improvements over existing kernel two-sample tests for a wide range of alternatives. We also propose alternative testing procedures that maintain high power with low computational cost, offering easy off-the-shelf tools for large datasets. The new approaches are compared to other state-of-the-art tests under various settings and show good performance. We showcase the new approaches through two applications: The comparison of musks and non-musks using the shape of molecules, and the comparison of taxi trips starting from John F. Kennedy airport in consecutive months. All proposed methods are implemented in an R package kerTests.