低能延迟乘积和积累的线性延迟电池设计

论文标题

低能延迟乘积和积累的线性延迟电池设计

Linear Delay-cell Design for Low-energy Delay Multiplication and Accumulation

论文作者

Shukla, Aditya

论文摘要

实用的深度神经网络（DNN）评估涉及成千上万的多重和积累（MAC）操作。为了将DNN的出色推理能力扩展到能源约束设备，必须开发最小化每种量级能量的架构和电路。在这方面，基于模拟延迟的MAC是有利的，这是因为额外的和固有的原因是MAC实施 - （1）DNN评估的固定点精度要求较低，（2）（2）对于较小的技术节点而言，基于电荷的积累优于电荷的累积范围，以及（3）简单的模拟数字互动。使用基于延迟的MAC实施DNN需要混合信号延迟乘数，该乘数接受数字存储的权重和模拟电压作为参数。为此，提出了一种新颖的，线性的可调节延迟电池，其中，使用倒置的MOS电容器（C*）从线性输入电压依赖的初始电荷来实现延迟。通过分析模型，确定了其功能有效性的约束，并开发了抖动模型。必须将与数字参数的每一点相对应的具有缩放延迟的多个单元格级联以形成乘数。为了意识到细胞的这种位延迟缩放，提出了一个偏置电路，该电路会产生子阈值栅极电压以扩展C*的放电速率，从而避免了面积较高的晶体管宽度缩放。对于130nm的CMOS技术，抖动的理论约束和限制用于找到最佳的设计点，并量化抖动与每次数字折衷的抖动。原理图级别的模拟显示出靠近图表的最新能量消耗，因此显示了细胞的可行性。

A practical deep neural network's (DNN) evaluation involves thousands of multiply-and-accumulate (MAC) operations. To extend DNN's superior inference capabilities to energy constrained devices, architectures and circuits that minimize energy-per-MAC must be developed. In this respect, analog delay-based MAC is advantageous due to reasons both extrinsic and intrinsic to the MAC implementation - (1) lower fixed-point precision requirement for a DNN's evaluation, (2) better dynamic range than charge-based accumulation, for smaller technology nodes, and (3) simpler analog-digital interfacing. Implementing DNNs using delay-based MAC requires mixed-signal delay multipliers that accept digitally stored weights and analog voltages as arguments. To this end, a novel, linearly tune-able delay-cell is proposed, wherein, the delay is realized using an inverted MOS capacitor's (C*) steady discharge from a linearly input-voltage dependent initial charge. The cell is analytically modeled, constraints for its functional validity are determined, and jitter-models are developed. Multiple cells with scaled delays, corresponding to each bit of the digital argument, must be cascaded to form the multiplier. To realize such bit-wise delay-scaling of the cells, a biasing circuit is proposed that generates sub-threshold gate-voltages to scale C*'s discharging rate, and thus area-expensive transistor width-scaling is avoided. For 130nm CMOS technology, the theoretical constraints and limits on jitter are used to find the optimal design-point and quantify the jitter versus bits-per-multiplier trade-off. Schematic-level simulations show a worst-case energy-consumption close to the state-of-art, and thus, feasibility of the cell.

下载PDF全文

下载文献需遵守相关版权规定

论文标题