用于端到端的小足迹语音触发检测的堆叠的1D卷积网络检测

论文标题

用于端到端的小足迹语音触发检测的堆叠的1D卷积网络检测

Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

论文作者

Higuchi, Takuya, Ghasemzadeh, Mohammad, You, Kisun, Dhir, Chandra

论文摘要

我们提出了一个堆叠的1D卷积神经网络（S1DCNN），用于在流场景中端到端的小足迹触发触发检测。语音触发检测是一个重要的语音应用程序，用户可以通过简单地说出关键字或短语来激活其设备。由于隐私和延迟原因，语音触发检测系统应在设备上始终对照的处理器上运行。因此，具有较小的内存和计算成本对于语音触发检测系统至关重要。最近，奇异值分解过滤器（SVDF）已用于端到端语音触发检测。 SVDF近似具有低等级近似的完全连接层，从而减少了模型参数的数量。在这项工作中，我们提出S1DCNN作为端到端小英寸小英寸语音触发检测的替代方法。 S1DCNN层由1D卷积层组成，然后是深度1D卷积层。我们证明SVDF可以表示为S1DCNN层的特殊情况。实验结果表明，与SVDF相比，S1DCNN具有相似的模型大小和相似的时间延迟，相对虚假拒绝比（FRR）降低了19.0％。通过使用较长的时间延迟，S1DCNN将FRR进一步提高到相对12.2％。

We propose a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system. Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection. An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. We show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By using longer time delays, the S1DCNN further improve the FRR up to 12.2% relative.

下载PDF全文

下载文献需遵守相关版权规定

论文标题