Accent is an important biometric characteristic that is defined by the presence of specific traits in the speaking style of an individual. These are identified by patterns in the speech production system, such as those present in the vocal tract or in lip movements. Evidence from linguistics and speech processing research suggests that visual information enhances speech recognition. Intrigued by these findings, along with the assumption that visually perceivable accent-related patterns are transferred from the mother tongue to a foreign language, we investigate the task of discriminating native from non-native speech in English, employing visual features only. Training and evaluation is performed on segments of continuous visual speech, captured by mobile phones, where all speakers read the same text. We apply various appearance descriptors to represent the mouth region at each video frame. Vocabulary-based histograms, being the final representation of dynamic features for all utterances, are used for recognition. Binary classification experiments, discriminating native and non-native speakers, are conducted in a subject-independent manner. Our results show that this task can be addressed by means of an automated approach that uses visual features only.
口音是一种重要的生物特征,它由个体说话风格中特定特征的存在所定义。这些特征通过语音产生系统中的模式来识别,比如声道或嘴唇运动中存在的模式。来自语言学和语音处理研究的证据表明,视觉信息可增强语音识别。受这些发现以及视觉可感知的与口音相关的模式从母语转移到外语这一假设的启发,我们仅利用视觉特征来研究区分英语母语者和非母语者语音的任务。我们对由手机拍摄的连续视觉语音片段进行训练和评估,所有说话者都朗读相同的文本。我们应用各种外观描述符来表示每个视频帧中的嘴巴区域。基于词汇的直方图作为所有话语动态特征的最终表示形式,被用于识别。以与说话者无关的方式进行区分母语者和非母语者的二分类实验。我们的结果表明,仅使用视觉特征的自动化方法可以解决这一任务。