论文标题

快速多元经验累积分布函数与内核密度估计的连接

Fast multivariate empirical cumulative distribution function with connection to kernel density estimation

论文作者

Langrené, Nicolas, Warin, Xavier

论文摘要

本文重新讨论了在大型多元数据集上有效地计算经验累积分布函数(ECDF)的问题。在一个评估点计算ECDF需要$ \ MATHCAL {O}(n)$在由$ n $数据点组成的数据集上的操作。因此,对$ n $评估点的ECDF进行直接评估需要一个二次$ \ MATHCAL {O}(n^2)$操作,这对于大规模问题而言是非常敏感的。提出并比较两种快速和精确的方法。第一个是基于词典顺序的快速求和,并带有$ \ Mathcal {o}(n {\ log} n)$复杂性,并要求评估点位于常规网格上。第二个是基于划分的原理,具有$ \ MATHCAL {O}(N \ log(n)^{(d-1){\ vee} 1})$复杂性,并且需要评估点才能与输入点一致。在一般$ d $维情况下,描述并详细介绍了这两种快速算法,数值实验验证了它们的速度和准确性。其次,本文为一大类内核建立了累积分布函数与内核密度估计(KDE)之间的直接连接。该连接为多元内核密度估计和内核回归的快速精确算法铺平了道路。 Laplacian内核的数值测试验证了所提出的算法的速度和准确性。广泛的大规模多元密度估计,累积分布估计,存活函数估计和回归问题可以从提出的数值方法中受益。

This paper revisits the problem of computing empirical cumulative distribution functions (ECDF) efficiently on large, multivariate datasets. Computing an ECDF at one evaluation point requires $\mathcal{O}(N)$ operations on a dataset composed of $N$ data points. Therefore, a direct evaluation of ECDFs at $N$ evaluation points requires a quadratic $\mathcal{O}(N^2)$ operations, which is prohibitive for large-scale problems. Two fast and exact methods are proposed and compared. The first one is based on fast summation in lexicographical order, with a $\mathcal{O}(N{\log}N)$ complexity and requires the evaluation points to lie on a regular grid. The second one is based on the divide-and-conquer principle, with a $\mathcal{O}(N\log(N)^{(d-1){\vee}1})$ complexity and requires the evaluation points to coincide with the input points. The two fast algorithms are described and detailed in the general $d$-dimensional case, and numerical experiments validate their speed and accuracy. Secondly, the paper establishes a direct connection between cumulative distribution functions and kernel density estimation (KDE) for a large class of kernels. This connection paves the way for fast exact algorithms for multivariate kernel density estimation and kernel regression. Numerical tests with the Laplacian kernel validate the speed and accuracy of the proposed algorithms. A broad range of large-scale multivariate density estimation, cumulative distribution estimation, survival function estimation and regression problems can benefit from the proposed numerical methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源