论文标题

MSP:FPGA特异性的混合式,多工深度神经网络量化框架

MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network Quantization Framework

论文作者

Chang, Sung-En, Li, Yanyu, Sun, Mengshu, Jiang, Weiwen, Shi, Runbin, Lin, Xue, Wang, Yanzhi

论文摘要

随着深度学习的巨大成功,即将需要将深度学习模型部署到边缘设备上。为了应对边缘设备中有限的计算和存储资源,模型压缩技术已被广泛用于修剪深神经网络(DNN)模型,以实现在设备推理执行中。本文以常用的FPGA(现场可编程门数阵列)设备为目标,作为DNN Edge计算的硬件平台。我们专注于DNN量化作为主要模型压缩技术,因为DNN量化对于在硬件平台上实现DNN模型至关重要。这项工作的新颖性是双重的:(i)我们提出了一种混合使用Scheme DNN量化方法,该方法同时结合了用于量化的线性和非线性编号系统,旨在提高异质计算资源的利用,即LUTS(查找表)和DSP(数字信号流程者)在FPGA上。请注意,所有现有的(单式)量化方法只能在深度学习计算中使用一种类型的资源(MAC(LUTS或DSP)(LUTS或DSP)。用于不同层的硬件配置,以减少计算开销,同时将模型精度保留为层间方法。

With the tremendous success of deep learning, there exists imminent need to deploy deep learning models onto edge devices. To tackle the limited computing and storage resources in edge devices, model compression techniques have been widely used to trim deep neural network (DNN) models for on-device inference execution. This paper targets the commonly used FPGA (field programmable gate array) devices as the hardware platforms for DNN edge computing. We focus on the DNN quantization as the main model compression technique, since DNN quantization has been of great importance for the implementations of DNN models on the hardware platforms. The novelty of this work comes in twofold: (i) We propose a mixed-scheme DNN quantization method that incorporates both the linear and non-linear number systems for quantization, with the aim to boost the utilization of the heterogeneous computing resources, i.e., LUTs (look up tables) and DSPs (digital signal processors) on an FPGA. Note that all the existing (single-scheme) quantization methods can only utilize one type of resources (either LUTs or DSPs for the MAC (multiply-accumulate) operations in deep learning computations. (ii) We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension. The intra-layer multi-precision method can uniform the hardware configurations for different layers to reduce computation overhead and at the same time preserve the model accuracy as the inter-layer approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源