CORC  > 北京大学  > 信息科学技术学院
Searching for historical events on a large-scale web archive
Huang, Lian&apos ; Lin, Wu ; Li, Xiaoming ; En
2010
英文摘要Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines' in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches. ? 2010 IEEE.; EI; 0
语种英语
DOI标识10.1109/SKG.2010.37
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/329612]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Huang, Lian&apos,Lin, Wu,Li, Xiaoming,et al. Searching for historical events on a large-scale web archive. 2010-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace