汉语语言处理及语言模型研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	汉语语言处理及语言模型研究
作者	张树武
学位类别	工学博士
答辩日期	1997-06-01
授予单位	中国科学院自动化研究所
授予地点	中国科学院自动化研究所
导师	黄泰翼 ; 马颂德
关键词	语言模型语音识别自然语言处理信息集成 Language model Speech recognition Natural language processing Information integration
学位专业	模式识别与智能系统
中文摘要	语言处理是语音识别过程的一个重要组成部分，也是一门与人工智能概率计算语言学信息论认知心理学相关的多学科交叉的综合研究课题。作者在三年的博士学位攻读期间，结合汉语语音识别的具体任 ‘务对汉语语言处理与高鲁棒性的汉语语言模型进行了全面的理论研究和工程实现主要开展的工作和取得的研究成果有 1 面向语音识别的具体任务提出并实现了一种语料处理和基础语言信息收集工程化实现方法：包括(1)提出了适合语音识别要求的语词定义及语词切分原则；(2)针对大规模语料处理可能造成的统计误羞提出一种基于语词联接概率的生词发现和统计偏差平滑方法(3)结合语音识别对语词分类的要求，提出并实现了一种基于语义分类的词典语词类属学习方法 2．基于两千万字的《人民日报》语料在国内率先建立了一个词的trigram 汉语语言模型并成功地应用于非特定人大词汇汉语听写系统中，使识别结果得到了明显的改善同日寸’也通过实践证实了语言模型在汉语语音识别过程中的重要作甩 3 对汉语N元语言模型中N值的合理取值问题进行了分析。并得出了四元模型是一种较好的汉语统计语言模型的结论口 4．提出了一种改进的Trigram语言模型。该模型在采用线性插值方法对数据稀疏问题做局部平滑的基础上分别利用语词相似性信息和局部的 POS信息对插值模型中的三元和二元局部概率进行评估进一步平滑统计数据的稀疏问题 5考虑语词在语句内的超距离语义关联关系，研究并建立了一种语句空间内语词组合关联(WA)语言模型。它与近序的n-gram模型结合可以描述和反映语句空间内的语序自组织规律 6．以上述两种语言模型为基础，提出了一种利用多种语言信息源和知识源对语词序列进行综合求解的信息集成语言模型。该模型是一个多信息源多模型的集成模式它主要有五个相对独立的子模型组合而成除上述改进的trigram模型和WA模型之外，还包括 ·随机上下文无关短语文法(SCFG)模型用于加强局部语词短语模式的匹配实现n-gram语词联接信息与局部的特殊文法规则知识在统一的语言表示模型中的有机结合； ·长序字串连接模型，用于补充和提高高可信度的长字序组织概率评分 ·动态自适应模型，用于在特定语言环境中动态调整模型的整体适用性能相应地，本文提出的七类语言知识源和信息源按照其具体的功用和结合紧密程度被分别融人上述语
英文摘要	With the development of speech recognition techniques, the study of language modeling is becoming more important, and is currently one of the hottest points in the field of speech and language processing. During three years for Ph.D degree, the author investigate systematically relevant problems on Chinese language modeling for speech recognition, and present some original approaches effective for solving these problems and improving speech recognition accuracy. The main contributions of the author are as follows: 1. Oriented on speech recognition, an engineering approach for corpus processing and language information acquisition was introduced. Some practical problems have also been discussed. 2. Based on corpus from full text of "People's Daily" in 1993 with about 20,000,000 Chinese characters, we have taken the lead in setting up a word trigram Chinese language model in 1995. It has been applied in our Chinese dictation system with 32K words successfully and has drastically reduced the error rate. 3. An improved trigram model with word similarity information and local POS knowledge was presented for further data smoothing. 4. Taking into account of the drawbacks of most current statistical language modeling, the author has proposed a kind of integrated language modeling with multi-KS (knowledge sources and probability sources) in multi-models including baseline model of improved trigram, word association (WA) model, practical stochastic phrase grammar (SCFG) model as well as dynamic adaptive model, etc. Preliminary experiments show that the integration of multi-KS is an effective solving strategy for the self-organization of linguistic units. 5. Some detail problems on language processing and modeling have also been investigated systematically. 6. A more general and flexible self-organization language modeling was tentative analyzed This dissertation is a survey of the author's work. It consists of eleven chapters. Chapter 1 is the retrospect of the development and representative works in the fields of natural language processing and speech recognition. Chapter 2 describes the current status and main approaches of language modeling. In chapter 3 and chapter 4, based on the basic process of language information acquisition from corpus, some special problems and corresponding solution are addressed, which include: · A set of the principle for Chinese word definition and segmentation used in speech recognition. · An approach of finding new words and smoothing statistical errors. · A novel algorithm of language information acquisition without the help of prime dictionary. · and a practical word class learning approach. Chapter 5 introduces Chinese interpolation trigram modeling and its application in our speaker- independent Chinese dictation system with 32K vocabulary. Chapter 6
语种	中文
其他标识符	400
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/5670]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张树武. 汉语语言处理及语言模型研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所. 1997.