CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 博士学位论文
题名汉语语言处理及语言模型研究
作者张树武
学位类别工学博士
答辩日期1997-06-01
授予单位中国科学院自动化研究所
授予地点中国科学院自动化研究所
导师黄泰翼 ; 马颂德
关键词语言模型 语音识别 自然语言处理 信息集成 Language model Speech recognition Natural language processing Information integration
学位专业模式识别与智能系统
中文摘要语言处理是语音识别过程的一个重要组成部分,也是一门与人工智 能概率计算语言学信息论认知心理学相关的多学科交叉的综合研 究课题。作者在三年的博士学位攻读期间,结合汉语语音识别的具体任 ‘务对汉语语言处理与高鲁棒性的汉语语言模型进行了全面的理论研究 和工程实现主要开展的工作和取得的研究成果有 1 面向语音识别的具体任务提出并实现了一种语料处理和基础语言信 息收集工程化实现方法:包括(1)提出了适合语音识别要求的语词定 义及语词切分原则;(2)针对大规模语料处理可能造成的统计误羞提 出一种基于语词联接概率的生词发现和统计偏差平滑方法(3)结合语 音识别对语词分类的要求,提出并实现了一种基于语义分类的词典语词 类属学习方法 2.基于两千万字的《人民日报》语料 在国内率先建立了一个词的trigram 汉语语言模型并成功地应用于非特定人大词汇汉语听写系统中,使识 别结果得到了明显的改善同日寸’也通过实践证实了语言模型在汉语 语音识别过程中的重要作甩 3 对汉语N元语言模型中N值的合理取值问题进行了分析。并得出了四 元模型是一种较好的汉语统计语言模型的结论口 4.提出了一种改进的Trigram语言模型。该模型在采用线性插值方法对数 据稀疏问题做局部平滑的基础上分别利用语词相似性信息和局部的 POS信息对插值模型中的三元和二元局部概率进行评估进一步平滑 统计数据的稀疏问题 5考虑语词在语句内的超距离语义关联关系, 研究并建立了一种语句空 间内语词组合关联(WA)语言模型。 它与近序的n-gram模型结合可以描 述和反映语句空间内的语序自组织规律 6. 以上述两种语言模型为基础,提出了一种利用多种语言信息源和知识 源对语词序列进行综合求解的信息集成语言模型。该模型是一个多信 息源多模型的集成模式它主要有五个相对独立的子模型组合而成 除上述改进的trigram模型和WA模型之外,还包括 ·随机上下文无关短语文法(SCFG)模型用于加强局部语词短语 模式的匹配实现n-gram语词联接信息与局部的特殊文法规则知 识在统一的语言表示模型中的有机结合; ·长序字串连接模型,用于补充和提高高可信度的长字序组织概率 评分 ·动态自适应模型,用于在特定语言环境中动态调整模型的整体适 用性能 相应地,本文提出的七类语言知识源和信息源按照其具体的功 用和结合紧密程度被分别融人上述语
英文摘要With the development of speech recognition techniques, the study of language modeling is becoming more important, and is currently one of the hottest points in the field of speech and language processing. During three years for Ph.D degree, the author investigate systematically relevant problems on Chinese language modeling for speech recognition, and present some original approaches effective for solving these problems and improving speech recognition accuracy. The main contributions of the author are as follows: 1. Oriented on speech recognition, an engineering approach for corpus processing and language information acquisition was introduced. Some practical problems have also been discussed. 2. Based on corpus from full text of "People's Daily" in 1993 with about 20,000,000 Chinese characters, we have taken the lead in setting up a word trigram Chinese language model in 1995. It has been applied in our Chinese dictation system with 32K words successfully and has drastically reduced the error rate. 3. An improved trigram model with word similarity information and local POS knowledge was presented for further data smoothing. 4. Taking into account of the drawbacks of most current statistical language modeling, the author has proposed a kind of integrated language modeling with multi-KS (knowledge sources and probability sources) in multi-models including baseline model of improved trigram, word association (WA) model, practical stochastic phrase grammar (SCFG) model as well as dynamic adaptive model, etc. Preliminary experiments show that the integration of multi-KS is an effective solving strategy for the self-organization of linguistic units. 5. Some detail problems on language processing and modeling have also been investigated systematically. 6. A more general and flexible self-organization language modeling was tentative analyzed This dissertation is a survey of the author's work. It consists of eleven chapters. Chapter 1 is the retrospect of the development and representative works in the fields of natural language processing and speech recognition. Chapter 2 describes the current status and main approaches of language modeling. In chapter 3 and chapter 4, based on the basic process of language information acquisition from corpus, some special problems and corresponding solution are addressed, which include: · A set of the principle for Chinese word definition and segmentation used in speech recognition. · An approach of finding new words and smoothing statistical errors. · A novel algorithm of language information acquisition without the help of prime dictionary. · and a practical word class learning approach. Chapter 5 introduces Chinese interpolation trigram modeling and its application in our speaker- independent Chinese dictation system with 32K vocabulary. Chapter 6
语种中文
其他标识符400
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/5670]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
张树武. 汉语语言处理及语言模型研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所. 1997.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace