西班牙语和英语音素通过培训在模拟的教室音频录音中的协作学习环境录音

论文标题

西班牙语和英语音素通过培训在模拟的教室音频录音中的协作学习环境录音

Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments

论文作者

Esparza, Mario

论文摘要

协作学习环境的音频记录包含跨对话和背景噪声的持续存在。在这些环境中，需要西班牙语和英语之间的动态语音识别。为了消除大规模地面真理的标准要求，论文通过将音频转录转换为音素并使用3D扬声器几何形状和数据增强来生成西班牙语和英语语音的声学模拟，从而开发了模拟数据集。该论文开发了一个低复杂性神经网络，用于识别西班牙语和英语音素（可在github.com/muelitas/keywordrec上获得）。当接受41个英语音素培训时，语音命令将达到0.099。当对36个西班牙音素进行培训并在协作学习环境的真实录音中进行测试时，将获得0.7208 LER。比Google的语音到文本为0.7272 LER稍好，它使用了15至1,635倍的参数，并在300至27,500小时的真实数据中接受了培训，而不是13个小时的模拟音频。

Audio recordings of collaborative learning environments contain a constant presence of cross-talk and background noise. Dynamic speech recognition between Spanish and English is required in these environments. To eliminate the standard requirement of large-scale ground truth, the thesis develops a simulated dataset by transforming audio transcriptions into phonemes and using 3D speaker geometry and data augmentation to generate an acoustic simulation of Spanish and English speech. The thesis develops a low-complexity neural network for recognizing Spanish and English phonemes (available at github.com/muelitas/keywordRec). When trained on 41 English phonemes, 0.099 PER is achieved on Speech Commands. When trained on 36 Spanish phonemes and tested on real recordings of collaborative learning environments, a 0.7208 LER is achieved. Slightly better than Google's Speech-to-text 0.7272 LER, which used anywhere from 15 to 1,635 times more parameters and trained on 300 to 27,500 hours of real data as opposed to 13 hours of simulated audios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题