Computational Methods for Speech Analysis

语音分析的计算方法

基本信息

批准号：
2120087
负责人：
Christopher Lucas
金额：
$ 24.93万
依托单位：
Washington University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-01 至 2024-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2120087&HistoricalAwards=false
关键词：
Computational Methods Speech Analysis

项目摘要

This research project will develop tools for testing hypotheses about human communication. Researchers generally study human communication from textual transcripts which omit vocal tone. The project will directly address the disconnect between the data-generating process - in which speakers and listeners use the auditory channel to convey both textual and non-textual signals - and the widespread practice of discarding speech audio. The investigators will extend their prior speech model, The Model of Audio and Speech Structure, to address some limitations of the model. In particular, the statistical extensions will accommodate multiple speakers and allow for the joint modeling of text and tone. To demonstrate the value of the statistical extensions, the model will be applied to two original video corpora - police body-worn camera footage and campaign speeches for federal office. New software will be developed that makes it easy for researchers to quickly annotate a large amount of speech audio. The browser-based tools will enable automatic and manual segmentation, along with labeling. Multiple graduate students will gain experience in computationally intensive research and software development. The tools to be developed will be incorporated into ongoing public-private collaborations to improve oversight of police officers in the field.This research project will extend the Model of Audio and Speech Structure (MASS), which analyzes conversation as a nested stochastic process in which (i) the flow of conversation unfolds as a sequence of utterances transitioning between speakers and their vocal tones, based on contextual covariates; and (ii) the auditory signal within each utterance unfolds as a hidden Markov model that transitions between phonemes which generate sound. The model enables social scientists to test hypotheses about how conversations are structured by fixed covariates (e.g., speaker gender, conversation role) and time-varying covariates (e.g., exogenous external stimuli, endogenous conversation trajectory such as the previous speaker's tone). In its current implementation, however, MASS has two key limitations: First, it uses resource-intensive human annotations of tone for each speaker, which limits application to contexts with many unique speakers, such as police body-worn camera footage. This project will develop extensions allowing the model to borrow strength by partial pooling across speakers with similar speech profiles. Second, MASS incorporates text as externally given metadata. The project will develop a new approach for joint modeling of text and audio which will incorporate a dynamic topic model into the flow-of-conversation layer of MASS. The investigators will conduct two applications to demonstrate the value of the multi-speaker and joint text-audio modeling extensions.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该研究项目将开发用于测试有关人类交流的假设的工具。研究人员通常从文本成绩单中研究人类的交流，这些沟通忽略了声音。该项目将直接解决数据生成过程之间的脱节 - 在该过程中，说话者和听众使用听觉渠道传达文本和非文本信号 - 以及丢弃语音音频的广泛实践。研究人员将扩展其先前的语音模型，即音频和语音结构的模型，以解决该模型的某些局限性。特别是，统计扩展将容纳多个扬声器，并允许文本和音调的联合建模。为了证明统计扩展的价值，该模型将应用于两个原始视频语料库 - 警察饰演的摄像机镜头和联邦办公室的竞选演讲。将开发新软件，从而使研究人员易于快速注释大量的语音音频。基于浏览器的工具将启用自动和手动分割以及标签。多个研究生将获得计算密集型研究和软件开发的经验。要开发的工具将纳入正在进行的公私合作中，以改善对现场的警察的监督。该研究项目将扩展音频和语音结构（MASS）的模型，该模型将对话分析为嵌套随机过程，在该过程中，（i）对话流作为演讲者和他们的人声之间的序言，基于上下文的序言，以此为基础，以此为基础上下文。（ii）每种话语中的听觉信号作为一个隐藏的马尔可夫模型展开，该模型在产生声音的音素之间过渡。该模型使社会科学家能够检验有关固定协变量（例如，说话者性别，对话角色）和时变的协变量（例如，外源性外部刺激，内源性对话轨迹，例如以前的扬声器的语气）的固定协变量（例如，说话者性别，对话角色）的构造假设。然而，在目前的实施中，质量有两个关键局限性：首先，它使用每个扬声器的音调的资源密集型人体注释，这将应用程序限制在许多独特的扬声器的环境中，例如警察戴着身体磨损的摄像机镜头。该项目将开发扩展，使该模型通过在具有相似语音概况的扬声器之间进行部分合并来借用强度。其次，弥撒将文本纳入外部给出的元数据。该项目将开发一种新的方法来建模文本和音频，该方法将将动态主题模型纳入质量交流层中。调查人员将进行两项申请，以证明多演讲者和联合文本审计建模扩展的价值。该奖项反映了NSF的法定任务，并使用基金会的知识分子和更广泛的影响评估审查标准，被认为值得通过评估来支持。