The paper describes a novel spectral conversion method for voice transformation. We perform spectral conversion between speakers using a Gaussian mixture model (GMM) on the joint probability density of source and target features. A smooth spectral sequence can be estimated by applying maximum likelihood (ML) estimation to the GMM-based mapping using dynamic features. However, there is still degradation of the converted speech quality due to an over-smoothing of the converted spectra, which is inevitable in conventional ML-based parameter estimation. In order to alleviate the over-smoothing, we propose an ML-based conversion taking account of the global variance of the converted parameter in each utterance. Experimental results show that the performance of the voice conversion can be improved by using the global variance information. Moreover, it is demonstrated that the proposed algorithm is more effective than spectral enhancement by postfiltering.
本文描述了一种用于语音转换的新型频谱转换方法。我们利用高斯混合模型(GMM)对源特征和目标特征的联合概率密度在说话人之间进行频谱转换。通过使用动态特征对基于GMM的映射应用最大似然(ML)估计,可以估计出平滑的频谱序列。然而,由于转换后的频谱过度平滑,转换后的语音质量仍然下降,这在传统的基于ML的参数估计中是不可避免的。为了减轻过度平滑,我们提出一种基于ML的转换方法,该方法考虑了每个语句中转换参数的全局方差。实验结果表明,通过使用全局方差信息可以提高语音转换的性能。此外,还证明了所提出的算法比通过后置滤波进行频谱增强更有效。