题名维吾尔语统计语言模型中建模基元的研究
作者张小燕
学位类别硕士
答辩日期2011-05-30
授予单位中国科学院研究生院
授予地点北京
导师王磊 ; 唐新余
关键词维吾尔语 语言模型 困惑度 模型基元 词素
学位专业计算机应用技术
中文摘要语言模型是描述自然语言内在规律的数学模型,在自然语言处理过程中占据着重要的地位,但目前维吾尔语语言模型的研究尚处于起步探索阶段,因此构建一个可靠的语言模型对于维吾尔语自然语言处理技术很关键。维吾尔语语言模型是维吾尔语自然语言处理技术的重要基石,广泛应用于语音识别、机器翻译、信息检索等领域,它的研究对促进新疆地区的少数民族自然语言信息处理技术的发展具有重要的意义。 本文针对当前维吾尔语语言模型存在的语料库资源匮乏、数据稀疏以及困惑度较高等问题,试图找出使困惑度最低的平滑算法和建模单元(基元)建立语言模型。具体研究工作如下所示: 为解决数据稀疏问题,本文研究了多种平滑算法,包括加法平滑算法、Good-Turing平滑、Witten-Bell平滑、Katz平滑、绝对折扣平滑、Kneser-Ney平滑。实验结果表明绝对折扣平滑算法的困惑度最低。 本文将基于电话信道的维吾尔口语对话的文本、双语教学系统中的课本教材以及一些日常用语作为实验数据,然后它们进行预处理,并将处理后的数据作为本实验中建立维吾尔语语言模型的文本语料。接着对维吾尔语文本语料进行分词,这里采用两种分词方法:一种是基于词典的维吾尔语词切分,一种是非监督式形态切分。从结果来看,后者的分词效果好于前者。 在基于维吾尔语分词的基础上,对传统的N-gram统计语言模型做出改进。将维吾尔语单词切分成不同单元,以它们作为建模基元建立了3种维吾尔语语言模型,并提出基于词素类的N-gram语言模型。本文利用SRILM 1.5.12工具包和MITLM 0.4工具包进行实验。结果表明,基于词素的维吾尔语语言模型的困惑度比基于词的维吾尔语语言模型的困惑度降低了约2/3,另外,基于词素的语言模型可有效减少字典词汇量,并有较好的词语的覆盖度。
英文摘要As a mathematical model to describe the inherent disciplines of natural language,language model occupies an important position in natural language processing. However, at the present time, the study of Uyghur language model is just at the beginning stage, so it is essential to built a reliable language model in natural language processing. Being the basic part in natural language processing of Uyghur,Uyghur language model is widely used in the field of speech recognition, machine translation, information retrieval, etc.,so a further study on Uyghur Language model will be of great significance for the rapid development on natural language processing of minority in XinJiang district. For the problems existed in the current Uyghur language model such as the scarcity of Uyghur corpus resource,the sparseness of data ,the high degree of perplexity and etc, This dessertation attempted to find the best smoothing method and model units to build Uyghur language model. The contents of this dissertation are as follows: To solve the problem of data sparseness,many smoothing methods such as Addition smoothing, Good-Turing smoothing, Witten-Bell smoothing, Katz smoothing, absolute discount smoothing, Kneser - Ney smoothing were studied. The experimental results shows that the perplexity of absolute discount smoothing was best. The experimental data were collected from transcription of phone based on Uyghur spoken dialog,and text from bilingual teaching system and some daily expression of Uyghur. After pretreatment, these data were processed into Uygur text corpora. Two word segmentation methods were adopted,one was Uyghur words segmentation method based on dictionary and the other was segmented in the unsupervised form. The results shows that the latter was better than the former. Based on Uyghur segmentation, the traditional N-gram statistical language model was improved. The Uyghur words can be divided into different units, using these units,three kinds of Uyghur language model were built and N-gram Language model based on morphemes class was proposed. In this thesis,a series of experiment were conducted using SRILM 1.5.12 toolkit and MITLM 0.4 toolkit,the results showed that the perplexity of the Uyghur language model based on morphemes was far below that based on word. And the perplexity of the former was reduced to about 2/3 of the latter one. Moreover, morpheme-based language model can effectively reduce the amount of dictionary vocabulary, and have better coverage.
内容类型学位论文
源URL[http://ir.xjipc.cas.cn/handle/365002/4412]  
专题新疆理化技术研究所_多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
张小燕. 维吾尔语统计语言模型中建模基元的研究[D]. 北京. 中国科学院研究生院. 2011.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace