题名语义检索技术研究及维吾尔文语义检索模型构建
作者马博
学位类别博士
答辩日期2012-04
授予单位中国科学院研究生院
授予地点北京
导师周俊林
关键词维吾尔文 语义检索 语义标注 语义相似度 查询分析
学位专业计算机应用技术
中文摘要互联网信息的指数级增长,使得搜索引擎成为互联网上最广泛的应用。随着用户对检索结果要求的提高,搜索技术面临着越来越严峻的挑战。语义Web的发展,为提高搜索技术指明了新方向。语义Web作为互联网的发展趋向,其文档包含的语义信息为数据的智能化处理提供了基础。研究语义检索关键技术,并将其应用到搜索引擎中,可有效改善检索效果,提高检索结果的查准率和查全率。 维吾尔文语义检索研究尚处于初级阶段,针对目前维吾尔文搜索引擎缺乏语义信息的问题,根据维吾尔文构词特点和语言特性,提出了一种语义增强型维吾尔文信息检索模型。该模型由知识库管理模块、语义标注模块、语义索引模块、查询分析模块以及结果排序模块构成。首先对维文单词进行词干提取,并将网页信息以三元组形式进行存储,形成维文网页知识库,然后通过计算文档与本体概念的相似度以及概念之间关系的相似度实现文档内容与本体概念的映射。将语义实体与网页之间的关联以倒排索引形式进行存储,并通过扩展用户输入和分析词间关系实现查询目标分析,最后通过计算用户查询与文档内容匹配度和关系相似度实现结果排序。 本研究主要包括如下四方面内容和研究成果: (1)研究语义检索模型框架:传统搜索模型不完全适用于语义检索,研究基于语义Web的语义检索模型框架可改善传统搜索引擎的搜索效果,并为构建基于语义Web的下一代搜索引擎提供基础。通过本体等知识库对文档的语义标注,语义搜索引擎可以对文档中包含的语义信息进行查询和推理。本研究结合维吾尔文语法特点,研究适用于少数民族语言文字的语义检索技术框架,构建适用于新疆地区小语种的语义搜索引擎。 (2)文档与本体映射方法:文档——本体映射是语义检索技术的重要组成部分,它将文档表示为机器可运行的形式,是基于语义的信息处理、信息检索的基础。通过研究文档与本体映射中的特征词(概念、属性等)选取,本体映射中的概念、属性对选择和相似度计算方法,以及映射过程中实例与本体概念、属性的相似度计算方法,实现文档与本体知识库的映射过程。 (3)语义相似度计算方法:语义相似度计算方法的优劣决定着语义搜索引擎返回结果的好坏。本文在结合维吾尔文语法特征的基础上提出了一种基于上下文的非监督语义相似度计算方法,对于计算相似度的一组词,提取其上下文信息构成上下文向量,并根据特征相似度计算方法计算它们的相似度,该方法不需要耗费大量的人力标注成本,同时考虑了网页入链、出链信息对相似度计算产生的影响。 (4) 查询目标分析与结果排序算法:查询目标分析是度量查询关键字与本体概念相似程度的过程,过去的研究方法主要用于判断查询词与本体概念的匹配程度,本文在此基础上同时考虑了查询词之间的词义关系,以及查询词匹配到的多个本体概念之间的语义关系来对用户查询进行分析。对于查询返回的搜索结果,由于在对网页进行语义标注和构建语义索引时已经考虑了概念之间的关系,因此搜索结果排序将概念间的关联关系作为排序依据;对于知识库未覆盖词汇,采用TF/IDF方法作为补充,采用一种可调节的结果排序算法对搜索结果进行排序。对排序返回的搜索结果,采用文档摘要作为输入,通过向量空间模型和奇异值分解等方式产生聚类标签,对搜索结果进行聚类。从而将返回的搜索结果聚集到几个具有意义的标签下,增强了模型的易用性。
英文摘要Along with the quickly increasement of web information, search engine has become the most widely used application of the Internet. How to achieve useful information for users has become more and more urgent for search technologies. The development of Semantic Web provides a new method for search engine. Semantic Web is the trend of future World Wide Web. Resources in Semantic Web contain enough structure information, which provides a possibility to process data semantically. The research and applying of semantic information retrieval technologies can effectively improve the the precision and recall of search engine. The research of Uyghur semantic information retrieval is still in the initial stage, to solve the problem of lacking semantic information in Uyghur search engine, a semantically enhanced Uyghur information retrieval model was proposed based on the characteristics of Uyghur language. The model is comprised of knowledge management module, semantic annotation module, semantic indexing module, query analyzing module and result ranking module. Firstly word stemming was carried out and web pages were represented by the form of N-triple to construct the Uyghur knowledge base, then the map between ontologies and web pages was established by computing concept similarity and relation similarity. Semantic invert index was built to save the association between semantic entity and web pages, and user query analysis was implemented by expanding the query and analyzing the relation between the queries, finally by combining the benefits of both keyword and semantic-based methods, sorting algorithm was implemented. Our work in this dissertation can be divided into the following parts: (1) The research of architecture of semantic information retrval. Traditional IR model is not suitable for semantic information retrieval; the research of architecture of semantic information retrval based on Semantic Web can improve the quality of search engine and provide the foundation of the next generation of search engine. After semantic annotation with the knowledge base, semantic search engine can query and reason the semantic information included in the documents. Based on the characteristics of Uyghur language, construct the semantic information retrval model for minority languages in Xinjiang province. (2) Mapping between documents and ontologies. The mapping module is an important part of the semantic information retrieval model, which converts documents into the formats which machine can process. The mapping is completed by the feature selection, semantic similarity computation. (3) Semantic similarity computation metric. The effectiveness of Semantic similarity computation metric determines the quality of search results. A new unsupervised Uyghur context-based semantic similarity metric is proposed combining the feature characteristics of Uyghur. For a given word pair, firstly the context feature vector was constructed, then similarity conputation combing link information was carried out. The proposed metric is automatic and do not require any annotated knowledge resources. (4) Query analysis and results ranking. The task of query analysis is to establish the mapping between query keywords and ontology knowledge base. Traditional research mainly considered the maping between keywords and ontologies, in our research, we present a method which combines lexical relationship and semantic relationship to analyses user’s query. Because we have established the semantic annotation, which considered the relationship between concepts, the returned results can be ranked based on the semantic relation. For the words which are not included in the knowledge base, we use TF/IDF as a supplement. A modulative method is proposed to rank the searching results. Then we use the snippets of the returned results as input, use vector space model and singular value decomposition to generate the clustering labels. The returned documents are clustered into these labels, which enhances the fessibility of the model.
内容类型学位论文
源URL[http://ir.xjipc.cas.cn/handle/365002/4363]  
专题新疆理化技术研究所_多语种信息技术研究室
作者单位中国科学院新疆理化技术研究所
推荐引用方式
GB/T 7714
马博. 语义检索技术研究及维吾尔文语义检索模型构建[D]. 北京. 中国科学院研究生院. 2012.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace