Deep Learning Based Complex Spectral Mapping for Multi-Channel Speaker Separation and Speech Enhancement

基于深度学习的复杂频谱映射，用于多通道说话人分离和语音增强

基本信息

批准号：
2125074
负责人：
Eric Fosler-Lussier
金额：
$ 39.06万
依托单位：
Ohio State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-01 至 2024-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2125074&HistoricalAwards=false
关键词：
Deep Learning Based Complex Spectral

项目摘要

Despite tremendous advances in deep learning based speech separation and automatic speech recognition, a major challenge remains how to separate concurrent speakers and recognize their speech in the presence of room reverberation and background noise. This project will develop a multi-channel complex spectral mapping approach to multi-talker speaker separation and speech enhancement in order to improve speech recognition performance in such conditions. The proposed approach trains deep neural networks to predict the real and imaginary parts of individual talkers from the multi-channel input in the complex domain. After overlapped speakers are separated into simultaneous streams, sequential grouping will be performed for speaker diarization, which is the task of grouping the speech utterances of the same talker over intervals with the utterances of other speakers and pauses. Proposed speaker diarization will integrate spatial and spectral speaker features, which will be extracted through multi-channel speaker localization and single-channel speaker embedding. Recurrent neural networks will be trained to perform classification for the purpose of speaker diarization, which can handle an arbitrary number of speakers in a meeting. The proposed separation system will be evaluated using open, multi-channel speaker separation datasets that contain both room reverberation and background noise. The results from this project are expected to substantially elevate the performance of continuous speaker separation, as well as speaker diarization, in adverse acoustic environments, helping to close the performance gap between recognizing single-talker speech and recognizing multi-talker speech.The overall goal of this project is to develop a deep learning system that can continuously separate individual speakers in a conversational or meeting setting and accurately recognize the utterances of these speakers. Building on recent advances on simultaneous grouping to separate and enhance overlapped speakers in a talker-independent fashion, the project is mainly focused on speaker diarization, which aims to group the speech utterances of the same speaker across time. To achieve speaker diarization, deep learning based sequential grouping will be performed and it will integrate spatial and spectral speaker characteristics. Through sequential organization, simultaneous streams will be grouped with earlier-separated speaker streams to form sequential streams, each of which corresponds to all the utterances of the same speaker up to the current time. Speaker localization and classification will be investigated to make sequential grouping capable of creating new sequential streams and handling an arbitrary number of speakers in a meeting scenario. With the added spatial dimension, the proposed diarization approach provides a solution to the question of who spoke when and where, significantly expanding the traditional scope of who spoke when. The proposed separation system will be evaluated using multi-channel speaker separation datasets that contain highly overlapped speech in recorded conversations, as well as room reverberation and background noise present in real environments. The main evaluation metric will be word error rate in automatic speech recognition. The performance of speaker diarization will be measured using diarization error rate.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

尽管基于深度学习的语音分离和自动语音识别取得了巨大进步，但主要挑战仍然是如何在存在房间混响和背景噪声的情况下分离并发发言者并识别他们的语音。该项目将开发一种多通道复杂频谱映射方法，用于多说话者说话者分离和语音增强，以提高此类条件下的语音识别性能。所提出的方法训练深度神经网络，以根据复杂域中的多通道输入来预测各个说话者的实部和虚部。将重叠的说话者分成同时流后，将执行顺序分组以进行说话者二值化，即将同一说话者的语音话语与其他说话者的话语和停顿按一定间隔进行分组。所提出的说话人二值化将集成空间和频谱说话人特征，这些特征将通过多通道说话人定位和单通道说话人嵌入来提取。循环神经网络将被训练来执行分类，以实现演讲者分类，从而可以处理会议中任意数量的演讲者。所提出的分离系统将使用包含房间混响和背景噪声的开放式多通道扬声器分离数据集进行评估。该项目的结果预计将大大提高不利声学环境中连续说话人分离以及说话人二值化的性能，有助于缩小识别单说话者语音和识别多说话者语音之间的性能差距。总体目标该项目的目的是开发一个深度学习系统，可以在对话或会议环境中持续区分各个发言者，并准确识别这些发言者的话语。该项目以同步分组的最新进展为基础，以独立于说话者的方式分离和增强重叠的说话者，主要关注说话者二值化，旨在对同一说话者在不同时间段的语音进行分组。为了实现说话人二值化，将执行基于深度学习的顺序分组，并将整合空间和频谱说话人特征。通过顺序组织，同时流将与较早分离的说话者流组合在一起形成顺序流，每个流对应于同一说话者截至当前时间的所有话语。将研究发言者定位和分类，以使顺序分组能够创建新的顺序流并处理会议场景中任意数量的发言者。通过增加空间维度，所提出的二值化方法为谁在何时何地发言的问题提供了解决方案，显着扩展了谁在何时发言的传统范围。所提出的分离系统将使用多通道说话者分离数据集进行评估，该数据集包含录制的对话中高度重叠的语音，以及真实环境中存在的房间混响和背景噪声。主要评估指标是自动语音识别中的单词错误率。说话者二值化的表现将使用二值化错误率来衡量。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。