题名基于概念知识关联的中文人名和机构名称识别
作者贾宁
学位类别博士
答辩日期2008-05-29
授予单位中国科学院声学研究所
授予地点声学研究所
关键词HNC 人名识别 机构名识别 概念知识关联
其他题名Chinese Person and Organization Entity Names Recognition Based on Conceptual Relationship Knowledge
学位专业信号与信息处理
中文摘要未登录词中的命名实体识别是自然语言处理中的一项重要的基础性问题,信息检索、信息抽取、问答系统、机器翻译等领域都对命名实体的识别有很高的要求。命名实体在实际语料中出现数量大,构成形式灵活,是处理的难点,具有较高的研究价值。 本文研究的中心是中文命名实体识别中的人名和机构名称识别,提出了基于概念知识关联信息进行识别的思路,将人名和机构名转化为一类特殊的概念来处理。以HNC理论为理论基础,以扩展句类分析为工程基础,在扩展句类分析的结果之上,通过句子各语义成分之间的概念关联,确定包含人名和机构名概念的语义成分。再对该语义成分进行分解,进一步确定人名和机构名概念的位置,最后从定位的词串中提取出人名和机构名称。本文的主要贡献和创新点包括: 1. 提出了基于在句类分析和领域句类基础上实现的人名和机构名称识别算法。该算法从语义的角度出发,通过句类分析和领域句类表示式判定包含人物和组织机构概念的语义块。再根据语义块的内部结构进行分解,进一步确定人物和组织机构类概念的位置,通过识别算法得到人名和机构名。实验表明,系统对于人物和组织机构类概念的定位有很好的准确率。 2. 对句类空间的语义块关联知识和知识库进行了详细的研究,设计了概念层面和词汇层面的语义块关联规则,并建立了针对人物和组织机构类概念的语义块关联规则库。测试表明语义块关联规则对包含pp概念语义块的判定有超过99%的准确率。 3. 提出了建立从领域知识到句类空间的映射的方法。句类空间的句类表示式和领域句类表示式间的对应关系分为显式对应和非显式对应,对显式对应情况下,通过对领域句类表示式的语义块和句类空间的语义块进行类别划分,建立了二者之间的映射关系。这种映射关系将领域知识与具体的语言空间联系起来,使领域句类对语义块的预期知识能够发挥作用。 4. 提出了非句蜕广义对象语义块BC复合构成的核心构成原则和实现BC分解的算法,解决了BC分解中语义块是否复合构成和BC两个部分的判定这两个关键问题,并对基本句类的全部广义对象建立了语义块BC分解规则,实现了由计算机进行BC分解。 5. 提出了以句类分析为基础,在句群范围内进行省略恢复的算法。重点分析了语义块部分共享造成的省略现象,研究了不同的共享情况,并给出了相应的处理规则。本文提出的算法可解决由语义块整块共享形成的省略,对语义块部分共享形成的省略也有很好的处理效果。 综上所述,本文在HNC理论的框架内,根据已有的理论和知识,提出并实现了从句类到语义块,从语义块到含命名概念的语义块构成成分,再从构成成分到具体命名的从上而下的方法。同时对这个过程中在领域句类、句类、语义块各层面涉及到的问题进行了研究、提出了解决方案。本文是对HNC已有成果的深入发掘和有益补充,同时也为HNC理论的具体应用提供了一个新思路。
英文摘要Entity name recognition is a basic problem in Natural Language Processing. It is widely used in Information Extraction, Information Retrieval, Q&A and Machine Translation. As entities names are large number and with various structures, the automatic recognition is a valuable researching field. This dissertation focuses on person and organization names recognition. This dissertation presents method based on conceptual relationship knowledge. Person and organization names are tag of language space. Their tag in conceptual language space is ‘pp’ tag. The sentence category knowledge and domain sentence category knowledge contains relationship between semantic chunks and anticipation of semantic chunks’ concept. Using the two kinds of knowledge will extract the semantic chunks which contains ‘pp’. After analyze structure of the semantic chunk, the position of ‘pp’ in semantic chunk will be found. Then, recognition arithmetic extracts person and organization names from semantic chunk. The main points of the contribution in this dissertation are listed following: 1. Presented a method for person and organization names recognition based on sentence category analysis and domain sentence category. The method includes three steps. First, we extract semantic chunks which contain ‘pp’ by using semantic chunk relationship rules. Second, we analysis the structure of semantic chunks which are extracted in step1, and extract the parts which contain ‘pp’. Third, we recognize person and organization names from the parts in step2. The experiment shows that the method gets precision more than 99% for extraction of semantic chunk containing ‘pp’ concept. 2. Studied semantic relationship knowledge in sentence category space and HNC knowledge database. Designed semantic chunk relationship rules of conceptual layer and lexical layer. Found semantic chunk relationship rules database aimed at ‘pp’ concept. The experiment shows that the semantic chunk relationship rules are effective for extracting semantic chunks which contain ‘pp’ concept from sentence. 3. Found the mapping between domain sentence category space and sentence category space. The corresponding between two spaces includes obvious corresponding and unobvious corresponding. For obvious corresponding, the mapping is found by classify semantic chunk of domain sentence category and sentence category. So, anticipation for semantic chunk of domain sentence category can be used for semantic chunk of sentence category by the mapping. 4. Presented the principle of object-content structure in GBK without sentence ecdysis. Designed method for object-content decomposition in GBK without sentence degeneration into chunk. There are two pivotal problems for object-content decomposition. The one is whether GBK’s structure is object-content structure. Another problem is judge which part is object and which part is content. This dissertation resolved the two problems and designed rules for all GBK in basic sentence category. 5. Studied ellipsis caused by semantic chunk share between sentences, especially ellipsis of ‘pp’ concept. This dissertation presented method for ellipsis resolution with relationship between sentences and analyze for semantic chunk structure. The experiment shows that the method can resolve ellipsis caused by full semantic chunk share exactly, and resolve the one caused by partial semantic chunk share effectively. In summary, based on the HNC theory frame, this dissertation presents the method for person and organization names recognition based on sentence category analysis, domain sentence category and analysis for semantic chunk structure. Furthermore, this dissertation studied resolution for several problems in HNC theory, such as object-content decomposition in GBK, semantic relationship knowledge of conceptual layer and lexical layer, resolving of ‘pp’ concept’s ellipsis, etc. The studies in this dissertation reinforced practicability of HNC theory and provided a new approach for HNC theory’s utility.
语种中文
公开日期2011-05-07
页码128
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/310]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
贾宁. 基于概念知识关联的中文人名和机构名称识别[D]. 声学研究所. 中国科学院声学研究所. 2008.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace