基于数据驱动的可视语音合成研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 硕士学位论文

题名	基于数据驱动的可视语音合成研究
作者	周密
学位类别	工学硕士
答辩日期	2008-05-28
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	陶建华
关键词	可视语音合成 MPEG-4 可视韵律基元选取 Talking head Visual prosody MPEG-4 Unit selection
其他题名	Data-Driven Visual Speech Synthesis
学位专业	计算机应用技术
中文摘要	可视语音合成技术的研究大大拉近了人机交互的距离，它不仅能提高人机交互的和谐性，还能改进交互识别和表达的准确性，可广泛地用于虚拟现实、虚拟主持人、虚拟会议、电影制作、游戏娱乐等很多领域。随着可视语音合成技术的逐步成熟，研究者们开始将研究重点转向以下两个方面： 1) 如何在人脸动画中融入其他的非语言信息，使合成的人脸不仅具有局部的唇动信息，而且能够做到自然的表情和头部运动，使人脸动画从“僵硬”走向“生动”，生成具有表现力的可视语音; 2) 如何在数据库大小与真实感之间进行平衡，在不降低合成效果的前提下，减小数据库大小，提高合成系统的灵活性及真实感。本文的研究按照以上思路展开，在已有的可视语音合成系统的基础上，通过对汉语中的可视韵律进行分析，采用了基于数据驱动模型的方法，在原有的系统中融入了非语言信息，建立了一个更加具有表现力的汉语文本-可视语音转换系统。本文首先简要介绍了可视语音合成的研究背景和研究内容，然后按照系统建立的三个主要部分分别阐述主要工作内容： 1) 研究了汉语表达中，中性情感状态下朗读语气时，韵律词边界对头部运动的影响以及音素发音本身对头部运动的影响。得到了关于双字韵律内部的头部运动规律，总结了对头部运动影响较大的抬头音素以及每句话发音前的头部初始化运动规律，为后期的可视韵律融合提供了理论支持； 2) 建立了多个适用于不同应用的基于MPEG-4标准的多模态数据库。使用运动实时捕获仪建立了CASIA多模态数据库；并从多模态数据库中分别分析、提取了基于MPEG-4标准的人脸运动特征，通过FAP参数提取方法，去除了大量的数据冗余信息，并利用可变形模板的方法增强了捕获数据的鲁棒性; 3) 实现了基于动态基元选取的映射方法进行文本到可视语音的转换。采用基于数据驱动的方法合成控制参数，经过后期的重采样和平滑处理，输出合成的人脸运动特征参数，驱动MPEG-4网格动画模型构建一个汉语可视语音合成系统。
英文摘要	The development of visual speech synthesis technology largely shorten the distance between human and computer, with the development of visual speech synthesis, more researchers are turning their research focuses to the following two aspects：1）How to integrate non-verbal information in facial animation to synthesize not only lip movement, but also facial expression, so as to make this virtual talking head more alive? 2）How to balance between the size of database and expressiveness? Or how to cut down the size of database without sacrificing the expressiveness of the talking head, thus make the overall system more flexible and realistic? Our study is carried out with the consideration of two points above, that's integrating non-verbal information into previous TTVS (Text to speech synthesize) system and seeking for new visual speech synthesis method so as to make a more expressive Chinese TTVS system. At first, the paper gives a brief introduction of the background and research content of visual speech synthesis. Then according to 3 main steps to establish such a system, the paper describes research work by step: 1) Make research into visual prosody in Chinese articulation. Especially how the boundaries of prosody word and phoneme affect the head movement when articulated in normal state, and this brings in some useful conclusion for the following synthesis. 2) Established a labeled MPEG-4 compliant multimodal database named CASIA Multimodal Database with motion capture system，also MPEG-4 compliant FAP parameters are abstracted from this multimodal database with little redundancy, and a deformable template method is implemented in this process to make the data captured more robust. 3) An expressive visual-speech synthesis system with vivid expression outputs is implemented with a method of dynamic unit selection in synthesizing parameters so as to drive a MPEG-4 face model.
语种	中文
其他标识符	200528014628077
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/7459]
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	周密. 基于数据驱动的可视语音合成研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2008.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们