题名使用语言概念空间特征的文本分类研究
作者张运良
学位类别博士
答辩日期2007-06-06
授予单位中国科学院声学研究所
授予地点声学研究所
关键词语言概念空间 HNC理论 文本分类 主题 作者写作风格 效果
其他题名The Study on Text Categorization Based on Features of Conceptual Language Space
学位专业信号与信息处理
中文摘要分类是文本处理中的一项重要的基础性工作,面向主题的文本分类可用于电子图书和期刊资源的加工,面向作者写作风格的文本分类可用于伪作鉴定、轶作确认和司法领域中文书作者的鉴定。文本分类也可以服务于信息检索等其它应用,改进其处理效果。语言概念空间是HNC认定的存在于人类大脑中的各种自然语言的普遍的共性部分,是人类交流的基础。语言概念空间的特征突破了各种语言的表象,揭示了语言深层的概念联想脉络。 本文研究的目的是通过对语言概念空间特征在文本分类中使用的理论分析和实验研究,探索改进文本分类效果的方法。 本文采取理论探索和实践检验相结合的研究方法,主要的研究内容包括:分析语言概念空间中各类特征的特点;选取有研究价值且现实可行的特征应用到文本分类;考察使用以上特征在文本分类中的性能表现并分析原因;针对已有算法中的不足进行改进,在改进中着重研究各种改进算法的原理、测试结果以及有关参数的确定。 在研究中,本文取得了如下成果: (1) 提出将表示语义深层的语言概念空间特征和向量空间模型相结合的文本分类策略,使用该策略形成的分类器取得了较好的分类效果:在面向主题分类中,MAFMmax(最大微平均F-测度)达到了0.904,在面向作者写作风格分类中,MAFMmax达到了0.984。 (2) 提出了混合句类特征向基本句类特征转化的处理策略,在最大程度保存混合句类信息条件下,有效降低了句类向量空间的维度。 (3) 根据部分文本中特征分布的非均匀性,提出并实现了长文本拆分判决算法,提升了分类器的效果。 (4) 提出并实现了多特征集成判决算法的三个方案,不同程度上提高了分类的效果;提出了特征选用的策略,并给出特征选用的参考顺序表(包含面向主题和作者写作风格两类不同需求下的13类语言概念空间特征)。 (5) 提出并实现柔性KNN算法,提升了分类效果;给出了算法的具体应用条件。 本文使用语言概念空间特征和有关的改进算法,取得了较好的分类效果,并且随着语言概念空间特征分析能力的加强和相关算法改进,性能还会进一步提升。
英文摘要Categorization is an important and basic work in the processing of texts. Text categorization for subjects is of benefit to processing of electronic books and periodicals. Text categorization for authorship can be used in fake identification, authorship recognition of lost book and judicial appraisal. Text categorization is also helpful in information retrieval and other applications. Conceptual language space is the universal nature of all human languages and the base of human intercommunion. The features of conceptual language space break the surface of language phenomena and reveal the mapping network of concepts. This dissertation aims to improve the effect of text categorization by the study on both theory and experiments of using conceptual language space features and improved categorization algorithms. This dissertation combine the theory exploration and experiment and mainly includes the following aspects: the analysis of the characters of features in conceptual language space, mainly sentence category space and conception space; the effect and cause of different features with KNN algorithm; different text categorization algorithms in different application background; theory, experiment and empirical parameters in different improved algorithm. The main results of this dissertation are listed as the following: (1) Propose the text categorization processing strategy of combining conceptual language space features with vector space model, which leads to good effect of text categorization. The MAFM (maximum micro-average F-measure) in text categorization for subject is 0.812 and for authorship 0.9. (2) Propose and implement the transform strategy from compound sentence category to primitive sentence category, reduce the dimension of sentence category vector space effectively with limited information loss of compound sentence category. (3) Based on the non-uniformity of feature distribution of text, propose and implement resolution judgment algorithm, which improve the effect of the categorization to some extent. (4) Propose and implement multi-feature integration judgment algorithm with 3 schemes, which improve the categorization effect in varying degrees. Propose the strategy of feature choice for integration and give the sequence list of 13 conceptual language space features for subject and authorship. (5) Propose and realize the flexible KNN algorithm, which improve the effect of text categorization. The application Constraints of this algorithm are also proposed. The usage of conceptual language space features and algorithm improvement receives good effect in text categorization. The performance can be improved by the enhancement of the analysis ability in conceptual language space and the development of better algorithms.
语种中文
公开日期2011-05-07
页码134
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/220]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
张运良. 使用语言概念空间特征的文本分类研究[D]. 声学研究所. 中国科学院声学研究所. 2007.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace