题名语音识别的自适应算法研究
作者李国强
学位类别博士
答辩日期1999
授予单位中国科学院声学研究所
授予地点中国科学院声学研究所
关键词语音识别 统计模式匹配 声学
中文摘要当前计算机自动语音识别技术已取得了很大的进展,其核心技术之一就是统计模式匹配。为确保统计模式匹配的有效性,必须收集大量数据来覆盖出现在语音识别应用中的所有声学方面的变化,如话者的变化、背景噪声、传声器和通信信道的不同影响,识别任务的不同等。但是在实际应用中不可能收集所有情形下的语音数据,并且更多不同场景下的语音数据将导致识别系统分散,不能做好任一个具体场景下的识别工作。有必要消除训练条件和测试条件之间的不匹配的影响,提高系统的识别性能,自适应技术利用测试条件下新话者的少量语音数据,能有效地减少训练条件和测试条件之间不匹配的影响。通常的语音识别估计技术分别独立地重估各个模型的参数,这类技术假设每个模型都有足够多的语音样本来估计它,因此在少量适应数据下,这些估计方法只能重估一小部分模型参数,很难估计出针对新话者语音特点的模型参数,识别性能通常不佳。本文提出了两个自适应算法,也同样适用于信道、噪声的自适应,但本文主要讨论话者自适应计算。1.最大似然平滑预测方法由于不同语音模型之间存在着相关关系,有必要开发话者自适应技术,利用语音模型之间的关系和少量新话者的语音数据,尽量多地适应模型参数,使得最终的模型更代表新话者的语音特点。利用训练条件下许多特定人语音模型,来估计语音基元模型之间的相关关系,它代表着许多话者语音模型相关性的先验信息,具有共同特点。但是考虑到经常存在的训练条件与测试条件之间的不匹配,有必要把上述一般性的相关信息修改成更加适合于测试条件下新语者的语音特性,为此我们引入了平移矩阵,并且应用共享技术,鲁棒地估计这些代表着新话者语音特性的平移矩阵,根据适应数据的分布情况来确定共享程度、划分共享类,在一个共享类中的所有语音模型共享一个平移矩阵,用这些模型的适应数据来估计该平移矩阵,最后用这个平移矩阵对这些模型进行预测,这样便确保预测更具针对性,得到更适合新话者的模型参数,预测适应数据中未出现的模型和适应数据不足的模型。另外我们是在向量的层次上进行预测,而不是在向量元素的层次上,这样可以同时考虑到所有参数向量元素的影响,使得预测结果更加精确。由于该算法需要较多的计算量和存贮量,因此该算法适用于简单的HMM模型,但对它的探索有助于进一步研究更有效的自适应算法。2.先验参数变换话者自适应算法引入先验分布是最大后验(MAP)算法与最大似然(ML)算法最基本的区别,也是MAP算法成功的原因。在严格的Bayes方法中根据人们对所研究问题掌握的主观知识或以往的经验,通常都假定这些先验参数是已知的。但是在实际应用中大多采用经验Bayes方法,直接从训练语音数据本身来估计先验参数,此时的先验参数并没有考虑到试条件下新话者的信息,存在着训练条件与测试条件之间的不匹配。为加速自适应进程,我们提出了先验参数变换方法,变换训练条件下的先验参数,使之便适合测试条件下新话者的语音特点。该算法对先验参数进行线性变换,最大化适应数据的后验概率,联合地估计变换参数和HMM参数,使得适应后的HMM参数更能描述新话者的语音特性;由于在不同模型的先验参数之间共享变换函数,因此能估计出适应数据中未出现的模型和适应数据不足的模型,确保适应后的模型更加适合新话者的语音个性,使得在少量适应数据下(如3句连续语音)也有效;由于通常的变换方法(如最大似然线性回归自适应算法,MLLR)对非特定人(SI)语音识别系统的HMM参数进行变换,而我们提出的先验参数变换自适应算法对先验参数进行变换,在MAP的框架下结合变换后的先验参数和SI语音识别系统的HMM参数,估计出新话者的语音MM参数,比起MLLR,它具有好的渐进性且性能更加稳定,不易受随机变化的影响;先验参数变换算法只需要新话者的适应数据和初始的HMM参数,计算量和存贮量都很小,有利于在实际的语音识别中应用,它是一个很有潜力话者自适应算法,下一步我们将把该算法用于实际的语音识别系统中。我们对纯净语音和电话语音进行比较实验,结果表明该算法具有良好的话者自适应性能,优于MLLR算法。
英文摘要Many advances have been achieved in the area of automatic speech recognition. This is largely attributed to the use of a statistical pattern recognition paradigm. To make the statistical pattern recognition paradigm effective, a huge of training data must be used to cover all possible acoustic variabilities in speech recognition applications, such as speaker variability, variability of noisy environment, mike and channel, different task constraints. But, it is impossible to collect the speech data in all situations. The speech recognition systems trained based on data collected in a wide range of acoustic conditions will become so diffused that they won't be able to do a good job in any specific acoustic condition. It is necessary to reduce the effects of mismatches between training and testing conditions and then improved the recognition performance. Speaker adaptation techniques use a few of speech data for a new speaker in a testing condition to compensate effectively the mismatches between training and testing conditions. We propose two algorithms of speaker adaptation in this work, which are suitable for adaptation of channel and noise. 1. Maximum likelihood smooth and prediction(MLSP) method. In usual estimation techniques, each model parameters are estimated separately. So the techniques must require enough speech samples for each model before they estimate all the models' parameters. When only limited adaptation data available, such techniques only re-estimate a small part of model parameters and then can not get a model specific for the new speaker. And therefore, their performance is not good. But it is well known that there are intrinsic relations between speech basic units. It is necessary to develop an adaptation algorithm to use the relations and few adaptation data to re-estimate more model parameters and then get speech models more suitable for the new speaker. Relationships between speech unit models are built from many sets of speaker dependent models. The relationships represent prior information about correlation between speech models for many speakers and are common to any speaker. To take into account mismatches between training and testing conditions, it is necessary to modify the above relationships to be more suitable for the new speaker in a testing condition. We introduce shift matrices for it. To robustly estimate the shift matrices representing characteristics of new speaker in a few adaptation data, we introduce a tying technique to use prior knowledge of phonetics. The scope and extent of tying shift matrices are determined by phonetic classes and the distribution of adaptation data. The speech models in each tying class share a shift matrix, which is estimated from the adaptation data corresponding to these models in this tying class. Then the shift matrix is used to predict the speech models in the tying class. Thus this can not only make prediction more specific for the new speaker, but also predict unseen models and under-trained models. It makes best use of the few adaptation data to quicken the adaptation process. In addition, MLSP makes predictions on vector level, instead of on element level. Therefor, it can take into account the effects from all a vector's elements and then can make predictions more accurate. 2. Prior parameter transformation(PPT) method. The introduction of a prior distribution is the basic difference between maximum a posteriori(MAP) algorithm and maximum likelihood algorithm and also reason for MAP's success. In strict Bayesian methods, prior parameters are usually assumed to be known based on the subjective knowledge about the problems involved. But in applications, an empirical Bayesian method is used to estimate the prior parameters from the training data itself, which do not contain the information for the new speaker. So there is still a problem of mismatches on the use of prior parameters. Therefore, it is necessary to transform the prior parameters in the training condition to make the transformed prior parameters more suitable for the new speaker and then to quicken the adaptation process. Prior parameters and HMM parameters are jointly estimated by making linear transformation on prior parameters and maximizing the posterior density function of adaptation data. Then the resulting prior parameters are more suitable for the new speaker and therefore the HMM parameters are more representative of the new speaker. In addition, transformation functions are tied by prior parameters of different models and then can be robustly estimated. Therefore, PPT can predict unseen modes and under-trained models and then be effective even in a small adaptation data (e.g. 3 sentences) while making adapted models more specific for the new speaker. Our proposed PPT algorithm transforms prior parameters and combines transformed prior parameters and SI's HMM parameters under MAP structure while the usual transformation adaptation algorithms (e.g. maximum likelihood linear regression, MLLR) make transformation on HMM parameters. Then it has good asymptotic property and is more stable in performance than MLLR. Because PPT only use a few of adaptation data and a set of HMM models, it can be easily used in speech recognition applications. It's a very potential adaptation algorithm. In the next step we will apply it in our lab's speech recognition system. We made experiments on clean speech and telephone speech. The results show that PPT has a good adaptation performance and is better than MLLR.
语种中文
公开日期2011-05-07
页码85
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/664]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
李国强. 语音识别的自适应算法研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 1999.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace