Achieving both high precision and high recall in near-duplicate detection

doi:10.1145/1458082.1458094

CORC > 北京大学 > 信息科学技术学院

	Achieving both high precision and high recall in near-duplicate detection
	Huang, Lian&apos ; Wang, Lei ; Li, Xiaoming ; en
	2008
英文摘要	To find near-duplicate documents, fingerprint-based para-digms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these approaches which may be improved. First, high score under these algorithms' similarity measurement implies high probability of similarity between documents, which is different from high similarity of the documents. But how similar two documents are is what we really need to know. Second, there has to be a tradeoff between hash-code length and hash-code multiplicity in fingerprint paradigms, which makes it hard to maintain a satisfactory recall level while improving precision. In this paper our contributions are two-folded. First, we propose a framework for implementing the longest common subsequence (LCS) as a similarity measurement in reason-able computing time, which leads to both high precision and recall. Second, we present an algorithm to get a trustable partition from the LCS to reduce the negative impact from templates used in web page design. A comprehensive exper-iment was conducted to evaluate our method in terms of its effectiveness, efficiency, and quality of result. More specifi-cally, the method has been successfully used to partition a set of 430 million web pages into 68 million subsets of simi-lar pages, which demonstrates its effectiveness. For quality, we compared our method with simhash and a Cosine-based method through a sampling process (Cosine is compared to LCS as an alternative similarity measurement). The result showed that our algorithm reached an overall precision of 0.95 while simhash was 0.71 and Cosine was 0.82. At the same time our method obtains 1.86 times as much recall as simhash and 1.56 times as much recall as Cosine. Compar-ison experiment was also done for documents in the same web sites. For that, our algorithm, simhash and Cosine find almost the same number of true-positives at a precision of 0.91, 0.50 and 0.63 respectively. In terms of efficiency, our algorithm takes 118 hours to process the whole archive of 430 million topic-type pages on a cluster of six Linux boxes, at the same time the processing time of simhash and Cosine is 94 hours and 68 hours respectively. When considering the need of word segmentation for languages such as Chinese, the processing time of Cosine should be multiplied and in our experiment it is 602 hours. Copyright 2008 ACM.; EI; 0
语种	英语
DOI标识	10.1145/1458082.1458094
内容类型	其他
源URL	[http://ir.pku.edu.cn/handle/20.500.11897/327547]
专题	信息科学技术学院
推荐引用方式 GB/T 7714	Huang, Lian&apos,Wang, Lei,Li, Xiaoming,et al. Achieving both high precision and high recall in near-duplicate detection. 2008-01-01.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们