论文标题
liteTransFormerSearch:无培训的神经体系结构搜索有效语言模型
LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
论文作者
论文摘要
变压器体系结构无处不在用作大规模自回归语言模型的构件。但是,找到任务性能(困惑)和诸如峰值内存利用率和延迟之类的硬件约束之间的最佳权衡的体系结构是非平凡的。各种硬件的扩散会加剧这一点。我们利用了令人惊讶的经验观察,即自回归变压器中解码器参数的数量与任务绩效有很高的相关性,而与架构拓扑无关。该观察结果有机地诱导了简单的神经结构搜索(NAS)算法,该算法使用解码器参数作为困惑的代理而无需进行任何模型训练。我们的无训练算法的搜索阶段(称为轻量级变压器搜索(LTS))可以直接在目标设备上运行,因为它不需要GPU。 LTS使用靶向设备测量值,提取了困惑性的帕累托 - 范特,而不是任何硬件性能成本。我们评估了从ARM CPU到NVIDIA GPU的不同设备的LT,以及两个流行的自回归变压器骨架:GPT-2和Transformer-XL。结果表明,可以通过高达1.5倍,2.5倍的运行时和1.2倍,下峰值存储器利用率下降1.5倍,2.5倍,可实现16层GPT-2和变压器-XL的困惑。与在14个任务中的350m参数OPT相比,LTS Pareto-Frontier模型在零和一弹性设置中进行评估时,可实现更高的平均精度,最大延迟较低1.6倍。 LTS在商品笔记本电脑上运行时,在不到3个小时的时间内提取了帕累托 - 弗朗西特。在搜索过程中,我们有效地删除了数百个GPU小时培训的碳足迹,为未来的自动回归语言建模提供了强大的简单基线。
The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5x, 2.5x faster runtime and 1.2x, 2.0x lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6x lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.