题名维吾尔文文本分类中文本表示的研究
作者董瑞
学位类别硕士
答辩日期2012-04
授予单位中国科学院研究生院
授予地点北京
导师周喜
关键词不平衡数据集 特征选择 文本分类 维吾尔文 文本表示 卡方检验 逆文档频数
学位专业计算机应用技术
中文摘要互联网的发展使得电子文本文档的数目飞速增长,自动文本分类越发的被人们所需要。文本分类作为数据挖掘、信息检索、机器学习等领域的热点问题,从最初的人工分类逐步发展到现在的由计算机自动完成分类。 英文和中文文本分类已经有很多研究人员进行了大量的研究,现已较为成熟并且已经有了实际应用。但是维吾尔文文本分类的研究,相对起步较晚,现阶段研究还较少,并没有一个成熟、稳定的方法应用于维吾尔文文本分类中。 文本表示是文本分类中一个非常重要的方面,其目的是将非结构化的文本文档转换成计算机可以处理和识别的形式。文本表示的内容包含:文本预处理、特征选择、特征权值计算几部分。本文从维吾尔文文本表示入手,详细研究维吾尔文文本表示各因素对最终分类结果的影响。 通过对维吾尔文进行词干提取和未进行词干提取进行对比实验,发现进行词干提取的分类精度要高于未进行词干提取的结果。在特征选择算法方面,和其他语言文本分类相似,传统的特征选择方法CHI和IG分类效果相近,与DF相比,能够取得更好的分类精度。在特征权值得表示方面,本文对特征权值算法进行了比较,实验结果表明TF*IDF的效果要好于布尔型和TF方法。 针对维吾尔文不平衡数据集问题,提出了一种结合CHI和IDF新特征选择方法—CIDF。实验表明该方法在不平衡数据集上表现要由于传统的特征选择方法。
英文摘要Along with the quickly development of World Wide Web, the number of electronic text document grows rapidly, and automatic text classification technology is becoming more and more important. As one of the hot issue of data mining, information retrieval, machine learning and other research area, text classification developed from manual classification to machine automatic classification. Many researchers have engaged in the research of English and Chinese text classification, and the achievements have been used into practice. On the contrary, Uyghur text classification is still in the initial stage, the research is relatively less than that in English and Chinese. For now, there is not a stable metric to solve the Uyghur text classification problem. Text representation is a very important issue in text classification, which aims to translate the unstructured text documents into the forms that computer can process. Text representation includes: text preprocessing, feature selection, feature weight calculation, etc. In this paper, the factors of Uyghur text representation have been studied and the effect to the classification results have been compared. We established a comparative experiment, in which the Uyghur texts are stemmed and un-stemmed, the results turned out the accuracy in the stemmed classification is higher than the other. In the comparison of feature selection methods, Uyghur text classification is similar to other language, the effect of traditional feature selection method CHI and IG is better than that of DF. In the comparison of feature weighting methods, the effect of TF*IDF method is better than that of Boolean method and TF method. For Uyghur imbalance dataset problem, a combination of CHI and IDF feature selection method—CIDF. Proved that the method performance due to the traditional feature selection methods on the imbalanced data set.
内容类型学位论文
源URL[http://ir.xjipc.cas.cn/handle/365002/4372]  
专题新疆理化技术研究所_多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
董瑞. 维吾尔文文本分类中文本表示的研究[D]. 北京. 中国科学院研究生院. 2012.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace