内核分离的转置卷积操作

论文标题

内核分离的转置卷积操作

Kernel-Segregated Transpose Convolution Operation

论文作者

Tida, Vijay Srinivas, Chilukoti, Sai Venkatesh, Hei, Xiali, Hsu, Sonya

论文摘要

转置卷积在许多深度学习应用中都表现出突出。但是，由于在每个行和列中的每个元素之后添加零之后，特征映射的大小增加，因此转置卷积层在计算范围内都在计算密集型。因此，扩展的输入特征图上的卷积操作导致硬件资源的利用率不佳。不必要的乘法操作的主要原因是在输入特征映射中的预定位置处的零。我们提出了一种算法级优化技术，用于有效的转置卷积实施以解决这些问题。基于内核激活，我们将原始内核隔离为四个子内核。该方案可以减少内存需求和不必要的乘法。我们提出的方法是使用Kaggle网站上的Flower DataSet使用Titan X GPU（Intel Dual Core CPU）的$ 3.09（3.02）\ times $ $更快的计算。此外，提出的优化方法可以推广到现有设备，而无需其他硬件要求。一个简单的深度学习模型，其中包含一个转齿卷积层来评估优化方法。它显示出使用具有Intel双核CPU的MNIST数据集的$ 2.2 \ times $ $更快的培训。

Transpose convolution has shown prominence in many deep learning applications. However, transpose convolution layers are computationally intensive due to the increased feature map size due to adding zeros after each element in each row and column. Thus, convolution operation on the expanded input feature map leads to poor utilization of hardware resources. The main reason for unnecessary multiplication operations is zeros at predefined positions in the input feature map. We propose an algorithmic-level optimization technique for the effective transpose convolution implementation to solve these problems. Based on kernel activations, we segregated the original kernel into four sub-kernels. This scheme could reduce memory requirements and unnecessary multiplications. Our proposed method was $3.09 (3.02) \times$ faster computation using the Titan X GPU (Intel Dual Core CPU) with a flower dataset from the Kaggle website. Furthermore, the proposed optimization method can be generalized to existing devices without additional hardware requirements. A simple deep learning model containing one transpose convolution layer was used to evaluate the optimization method. It showed $2.2 \times$ faster training using the MNIST dataset with an Intel Dual-core CPU than the conventional implementation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题