题名鲁棒语音识别的声学特征研究
作者马昕
学位类别博士
答辩日期2005
授予单位中国科学院声学研究所
授予地点中国科学院声学研究所
关键词动态谱减 调制谱 信噪比估计 语音识别特征 鲁棒性
其他题名Research on Acoustic Features in Robust Speech Recognition
中文摘要噪声鲁棒性问题是当前语音识别技术面临的主要挑战之一。安静环境下的语音识别系统已经可以达到较好的性能,然而,一旦语音环境改变,系统性能将发生较大改变。如在噪声环境下,系统性能往往有较大降低。通过改善识别特征来提高识别系统鲁棒性是当前鲁棒语音识别研究中的重要方法。本文通过深入研究语音的声学特征,提出了对当前语音识别中常用声学特征的几种改进方法。,论文的主要贡献如下:1)鲁棒语音识别中基于非线性语音特征处理的研究在语音识别中,MFCC常被用作短时一段语音的特征参数,通常采用对数化的Mel能量计算DCT得到。在对数Mel能量谱上,共振峰部分对噪声的敏感程度远小于谷值的敏感程度,基于这种思想的语音特征提取方法有谱谐波解眷法(sRDs)[Lim,1979]RMFcc[To知da,1994]。然而实验表明,单纯依靠对短时Mel能量谱作非线性处理对清音的效果较差,难以在大词表连续语音识别任务中有好的表现,而且对端点检测的依赖性较大,为了提高这种算法的性能,本文设计了谷值移除(VALLEYREM0vED,简称VR)方法结合动态最小Mel能量谱减(DMSS)的方法,并通过实验表明,VR结合DMSS的算法性能稳定,对计算能力的要求较低,经过处理后的谱共振峰形状得到了很好的保持。实验还表明,单独使用本文提出的谷值移除的方法或动态最小Mel能量谱减(DMSS)的方法在连接词识别实验中也有较好的表现,而VR结合了DMSS方法后,不管是简单任务还是复杂任务,语音识别系统对加性噪声的鲁棒性能都有了更进一步的提高。例如在受到SNR=10dB的Babble噪声污染时,本文提出的方法在对于清辅音的识别准确率从35.45%提高到58.75%,对于浊辅音和元音的识别准确率也分别从56.78%、67.91%提高到76.75%和78.57%;而音节准确率则从35.55%提高到58.36%。2)基于调制谱特征的语音识别算法研究本文基于时频分布理论和语音感知的心理实验结论,提出了将对语音感知敏感的调制谱特征直接作为语音识别特征用于语音识别的新方法,并为此设计了三种特征参数,分别是基于傅立叶变换的调制谱参数、调制谱参数结合标准MFCC参数、基于小波分析的小波调制尺度参数。实验表明尽管墓于傅立叶变换的调制谱参数单独用作识别参数的性能不如基线参数,但结合了12维标准MFCC参数后,识别率接近MFCC结合一阶、二阶差分的识别率,说明基于傅立叶变换的调制谱参数可用作语音的动态特征。其作用相当于MFCC的一阶、二阶差分。为了克服基于傅立叶变换的调制谱在时间尺度上分辨率低的弱点,本文设计了小波调制尺度参数,实验证明直接将其用作识别参数可取得近似于MFCC参数的识别率。通道畸变和时间尺度畸变是常见得两种畸变形式,本文设计了一种应用在调制谱参数或小波调制尺度参数的归一化处理方法,通过这种归一化处理后,参数对通道畸变和时间尺度畸变的鲁棒性增强,本章的识别试验结果也证明了这一点。3)基于调制谱特征的信噪比估计在识别中的应用研究神经生理学的研究表明,人的听觉神经对调制谱的反映与对特征频率的反映在人脑中呈现近似正交的分布。幅度调制谱模式(AMs)[J.Tchorz,20ol]就是根据这种听觉生理学上的研究成果设计的一种新型的语音聚类模式。基于AMS的信噪比估计[不仅仅是对于浊音段有良好的效果,对不具有基音和谐波成分的清音段也有较好的效果[J.Tchorz,2001],本文着重研究了AMS信噪比估计方法在语音识别中的应用。通过AMS模式直接估计得到加权信噪比,然后采用谱减的方法就可使识别系统的对含有加性噪声的语音识别率得到提高。此外,为进一步提高识别效果,本文设计了改进的谱减法和采用置信区间对噪声估计结果检验后作谱减的方法,实验结果证明,这两种方法都可进一步提高识别性能。采用置信区间对噪声估计结果检验后作谱减的方法使得识别器的鲁棒性更强,是一种有效的抵抗加性噪声的方法。4)嵌入式鲁棒语音识别调制谱特征的应用研究在对调制谱的研究中发现,单独使用调制谱参数在简单识别任务中的性能与标准MFCC参数相差不多,适当加大计算调制谱所需的时长对识别造成的影响不大。在简单识别任务中,这一点对受资源制约的嵌入式系统非常有用。此外,在定点嵌入式系统中,由于受字长、数的表示方式以及存储等方式的影响,往往会造成计算结果出现较大偏差,从而造成嵌入式系统上的识别率降低。对调制谱采用归一化处理方法对上述影响具有补偿作用。在定点DSP仿真系统进行的仿真实验表明,采用调制谱特征作为简单识别任务中的识别参数可有效降低对系统资源的需求,采用了归一化补偿的方法后,对于孤立词识别,识别率可比未归一化时提高3个百分点,接近于MFCC参数的识别率。
英文摘要Modern Automatic Speech Recognition (ASR) systems work well in quiet environment. But when it works in real-word noisy environments, the performance of ASR system degrades rapidly. So robustness against noises arises to be one of the most challenging problems. Improving the acoustic features is one of the important solutions for robust speech recognition. In this thesis, by researching the effect of noises to acoustic features in ASR, we propose some improved ways of extracting speech features. The main contributions of this thesis are: 1) Research on non-linear short time features spectrum processing combined with dynamic minimum subband spectral subtraction (DMSS). In speech recognition system, MFCCs are often used as short time speech features which usually obtained by log and DCT operation on short time power spectrum. It is well known in log Mel power spectrum the peak are not so sensitive as the valley for the perturbations, so based this view, many techniques such as spectral root homomorphic deconvolution system (SRDS)[Lim,1979] and root Mel cepstral coefficients RMFCC[Tokuda,1994], are presented to suppress this unnecessary sensitivity of the log Mel power spectrum of the speech signal. This paper adopts valley removed log Mel power spectrum as short time features and combined with dynamic minimum subband spectral subtraction to improve the robustness of ASR. Experimental results show the proposed method are stable and yield a good performance in ASR. Only use valley removed(VR) techniques in ASR can make good effects in simply recognition task, but it depends on the precision of voice detection, and when used in large vocabulary continuous speech recognition, the performance degrades especially for unvoiced speech. After combined with DMSS, the VR method yield good performance not only in simple recognition task but in complex task as well. For example, for unvoiced speech, the accuracy can be improved from 35.45% to 58.75% 2) Robust speech recognition research based on modulation spectrum Based on theory of time-frequency distribution and human perception, we propose directly using modulation spectrum(MS) as short time speech features in ASR, and have designed a set of speech features based on modulation spectrum(MS), they are modulation spectrum coefficients(MSC), combined parameters of MFCC&MSC, wavelet modulation scales(WMS).Our test shows although the perfomiance of MSC is slightly worse than MFCC with delta and acceleration coefficients, when combined with standard MFCC, the MSC can achieve almost similar perfomiance as compared to MFCC with delta and acceleration coefficients. So we can say modulation spectrum based on FFT are good dynamic features of speech, the function of them are similar to delta and acceleration coefficients. To improve the resolution of the MS based FFT in time scale, we design the MS features based wavelet analysis-wavelet modulation scales. Experiment in large vocabulary show they can achieve similar perfomiance as compared to MFCC with delta and acceleration. Channel distortion and time distortion is often occur in ASR, so we design normalized method on MS or WMS. Experiments shows normalized MS or WMS yields good robustness against channel or time distortion. 3) Research on application of AMS SNR estimation in speech recognition Neurophysiological finding suggests the "periodotopical" organization of the neurons with respect to different best modulation frequencies are almost orthogonal to the tonotopical organization of neurons with respect to center frequencies, Motivated by this, AMS pattern has been designed as a novel approach of classification. The AMS pattern contributes a reliable discrimination not only between voiced speech and noise but also between unvoiced speech and noise. This paper mainly focuses on the application of AMS SNR estimation in robust speech recognition. By directly applying this method, we can obtain the SNR weighted by triangle filters, then spectrum subtraction will be used to eliminate the effects of noises. To further improve the performance we designed the "Improved SS" and "SS based Confidence Interval Test" for post processing after SNR estimation. Experiment results show the latter yield better performance for attenuating the effect of noises. 4) Modulation spectrum application for embedded robust recognition system. Extension of modulation spectrum experiments in simply recognition task show using the modulation spectrum as short time speech features can yield similar perfomiance as compared to MFCC, Moreover, if we properly prolong the time required for calculating modulation spectrum, the recognition rate only have a slight reduction. This character is helpful to the embedded recognition system which are limited with the resource of hardware. In addition, in fixed-point embedded system, being the effect of mode of storage, representation of number, the calculated result might have great deviation, and this will lead to the reduction of recognition rate. To some degree, normalized modulation spectrum features can compensate the above effects. Experiment on fixed-point DSPs simulating system show using modulation spectrum as features can effectively reduce the requirement for resource in simple recognition task. After normalized, the recognition rate can acquire 3% increase and has been close to that of MFCC.
语种中文
公开日期2011-05-07
页码99
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/916]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
马昕. 鲁棒语音识别的声学特征研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2005.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace