深入了解深度非线性过滤器，以改善多通道语音增强

论文标题

深入了解深度非线性过滤器，以改善多通道语音增强

Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement

论文作者

Tesch, Kristina, Gerkmann, Timo

论文摘要

使用多个麦克风来增强语音的关键优势在于，空间滤波可用于补充节奏光谱处理。在传统的环境中，通常单独执行线性空间过滤（波束形成）和单通道后过滤。相比之下，采用深层神经网络（DNN）有一种趋势来学习联合空间和速度 - 光谱非线性滤波器，这意味着对线性处理模型的限制以及对空间和节奏光谱信息的单独处理的限制可以潜在地胜过。但是，尚不清楚导致此类数据驱动的过滤器以良好性能进行多通道语音增强的内部机制。因此，在这项工作中，我们通过仔细控制网络可用的信息源（空间，光谱和时间）来分析由DNN实现的非线性空间滤波器的性质及其与时间和光谱处理的相互依存。我们确认了非线性空间处理模型的优越性，该模型在挑战性的扬声器提取方案中优于甲骨文线性空间滤波器，以低于0.24的POLQA得分，较低的麦克风的表现。我们的分析表明，在特定的光谱信息中应与空间信息共同处理，因为这会提高过滤器的空间选择性。然后，我们的系统评估会导致一个简单的网络体系结构，该网络体系结构优于扬声器提取任务上的最先进的网络体系结构，而CHIME3数据上的POLQA得分为0.22 POLQA得分。

The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题