论文标题
稀疏梯度降低差异
Variance Reduction with Sparse Gradients
论文作者
论文摘要
SVRG和SpiderBoost等方差降低方法使用大小批处理梯度的混合物来减少随机梯度的方差。与SGD相比,这些方法至少需要每次更新的操作数量到模型参数的两倍。为了降低这些方法的计算成本,我们介绍了一个新的稀疏操作员:随机TOP-K运算符。我们的操作员通过组合TOP-K运算符和随机坐标下降算子来估计各种应用中表现出的梯度稀疏性来降低计算复杂性。使用该操作员,大批处理梯度为降低差异提供了额外的好处:梯度稀疏性的可靠估计。从理论上讲,我们的算法至少与最佳算法(SpiderBoost)一样好,并且每当随机TOP-K运算符捕获梯度稀疏性时,性能进一步表现出色。从经验上讲,我们的算法在各种任务上使用各种模型在包括图像分类,自然语言处理和稀疏矩阵分解的各种模型上始终优于SpiderBoost。我们还提供了经验证据,以通过简单的梯度熵计算来支持我们算法背后的直觉,该计算可在每次迭代时量化梯度稀疏性。
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top-k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (SpiderBoost), and further excels in performance whenever the random-top-k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms SpiderBoost using various models on various tasks including image classification, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.