论文标题
探索基于回归的语音增强的深层混合张量 - 向量网络架构
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement
论文作者
论文摘要
本文通过采用多种深度张量向量回归模型来增强语音,研究了模型参数数量和增强语音质量之间的不同权衡。我们发现,混合体系结构,即CNN-TT,能够通过降低模型参数大小来保持良好的质量性能。 CNN-TT由底部的几个卷积层组成,以提取特征,以提高顶部的语音质量和张量训练(TT)输出层,以减少模型参数。我们首先获得了基于卷积神经网络(CNN)向量向量回归模型的概括能力的新上限。然后,我们在爱丁堡嘈杂的语音语料库上提供了实验证据,以证明,在单渠道语音增强中,CNN以少量的模型尺寸为代价,以牺牲DNN优于DNN。此外,CNN-TT仅利用CNN模型参数的32 \%来略高于CNN的表现。此外,如果CNN-TT参数的数量增加到CNN模型大小的44%,则可以进一步提高性能。最后,我们对模拟嘈杂的WSJ0语料库进行多渠道语音增强的实验表明,我们提出的混合CNN-TT体系结构比DNN和CNN模型在更强大的语音质量和较小的参数尺寸方面取得了更好的结果。
This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality and a tensor-train (TT) output layer on the top to reduce model parameters. We first derive a new upper bound on the generalization power of the convolutional neural network (CNN) based vector-to-vector regression models. Then, we provide experimental evidence on the Edinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at the expense of a small increment of model sizes. Besides, CNN-TT slightly outperforms the CNN counterpart by utilizing only 32\% of the CNN model parameters. Besides, further performance improvement can be attained if the number of CNN-TT parameters is increased to 44\% of the CNN model size. Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes.