口语语音交互关键问题研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	口语语音交互关键问题研究
作者	陈萧
学位类别	工学博士
答辩日期	2016-05-20
授予单位	中国科学院研究生院
授予地点	北京
导师	徐波
关键词	语音交互语音识别文本语气识别语音语气识别语音活动检测增长式语音识别基频提取
中文摘要	近年来，语音交互技术迎来了新的发展高潮。语音交互技术的性能得到了极大提升，语音交互技术的产品层出不穷。语音识别是语音交互系统中的前端处理模块，其性能对语音交互系统的性能非常重要。但是，传统的语音识别在交互方面仍然存在一定的不足，不能适应新产品的需要。比如，其交互的内容仅是转写文本，存在信息缺失的问题；其输入方式是按压或点击，存在操作不方便不自然、并且易受空间距离、环境光线影响的问题；其输出方式是在整句语音被完全输入并完全识别后才输出转写文本，存在响应不够快速的问题。其较少的交互内容和不自然的交互方式不利于用户的体验。因此，本文以提高语音交互的友好性为目的，主要围绕语音识别中更丰富的交互内容，和语音识别中更方便自然的交互方式这两个方面进行了研究和探讨。下面是本论文的主要创新点和贡献。在基于文本的语气识别上，本文提出了基于全局词汇信息的语气识别方法。该方法使用全局词汇信息从三个不同的句子粒度上对句子进行了建模，并使用多层感知机对不同粒度的建模结果进行了融合，实现了对单句口语文本的语气识别。实验结果显示，该方法优于基于隐事件语言模型、条件随机场等使用局部词汇信息的方法，也优于基于循环神经网络语言模型方法。在基于语音的语气识别上，本文提出了基于声调特征和韵母特征的语气识别方法。该方法在基于声学特征、韵律特征和语调特征的语气识别方法的基础上，通过增加声调、韵母相关的特征来进行语气识别，并利用决策树进行特征选择来进一步优化特征。实验结果表明，这些增加的特征以及特征选择方法都能提升语气识别的准确率。在语音活动检测上，本文提出了一种基于子空间高斯混合模型和音素合并的语音活动检测算法。该方法在音素识别的框架下，使用子空间高斯混合模型进行声学建模，使用基于专家知识和数据驱动的方法进行建模单元的确定；该方法可以直接利用语音识别的现有标注数据训练模型，减轻标注负担。实验结果显示，与强制对齐语音识别结果所产生的语音静音结果相比，该方法的帧错误率减小了约一半左右。在增长式语音识别上，针对输出结果不稳定的问题，本文提出一种基于稳定时间预测的解决方法。该方法使用连续多帧的N-best路径的声学打分信息预测当前输出结果在将来的稳定时间，从而可以提前判断当前输出结果的稳定性。实验结果显示，该方法减小了算法的时间延迟，也即提高了算法的稳定性。在语音交互的基础技术基频提取上，本文提出了一种改进的基于自相关函数的基频提取算法。该方法在原始自相关函数方法的基础上，通过利用语音频谱的纹理特征来提高正确基频值的权重，利用增加候选基频的个数来增大搜索空间，以及利用可靠种子来限制搜索路径这三项措施增加了正确基频值在搜索空间中的出现比例和权重，优化了搜索空间。实验结果显示，该方法改善了原有基频提取算法的性能。
英文摘要	In the past few years, speech interaction technology has experienced a new wave of development. The performance of speech interaction technology has been greatly improved. And a lot of products of speech interaction technology have been developed. Speech recognition is the front-end processing module in the speech interaction system. Its performance is very important to the performance of the speech interaction system. However, there are still some deficiencies in the traditional speech recognition, which can not meet the needs of new products. For example, its interactive content is transcript; there is a problem that some information is lost. Its input method is to press or click; there is a problem that the operation is inconvenient and not natural. And its output method is that the transcript is outputted after the whole speech was completely inputed and fully recognized; there is a problem that the response is not fast enough. The less interactive content and the unnature interaction methods are not good to the user experience. Therefore, this paper aims to improve the friendliness of speech interaction. It focuses on the two aspects, which are the enrichment of the interactive content of speech recognition, and the more convenient and natural way of interaction. The main work and contributions are as follows. For text information based intonation recognition, proposed a global lexical information based method. The idea is that, first, modeling the relations between global lexical information and intonation by different segment-level of sentence, then, combining these models using multi-layer perception, final, classifying intonation. Results indicate that, the proposed method outperforms the HELM based and CRF based methods using local lexical information, and also outperforms RNNLM based method. For audio information based intonation recognition, proposed a tone features and vowel features based method. This paper first anlysised the speech features, prosody features and intonation features, then proposed a new method based on tone features, vowel features and feture selection. Experimental results show that the new features and feature selection both can increase the performance. For speech activity detection, proposed a subspace Gaussian mixture model and phoneme combination based method. This method is efficient, small and can utilize speech recognition corpus directly. Results indicate that, the proposed method results in a half of baseline’s frame error rate. For the unstability problem in incremental speech recognition, proposed a new method based on the stable-time prediction. It predicts the probable stable-time of the current best partial result in the future, using the acoustic score information of N-best paths of successive frames. So it can determine whether to output the current best partial result in advance. Results indicate that, the proposed method outperforms the baseline. It can reduce lags and improve performance. For pitch extraction, proposed an improved autocorrelation function based pitch extraction algorithm for speech processing. The method uses three measures to rectify the weight and proportion of right pitch values in the search space. Experimental results show that the proposed method has a great increase in performance over the other algorithms.
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/11818]
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	陈萧. 口语语音交互关键问题研究[D]. 北京. 中国科学院研究生院. 2016.