CORC  > 厦门大学  > 化学化工-已发表论文
Pass-Join-K:多分段匹配的相似性连接算法; Pass-Join-K: Similarity Join Method Based on Multi-Match Partition
余海洋 ; 林琛 ; 陈珂 ; 江弋 ; 邹权
2013
关键词编辑距离 相似性连接 多次匹配 数据清理 Pass-Join-K算法 edit distance similarity join multi-match data cleaning Pass-Join-K
英文摘要相似性连接是数据清理工作的基本模型,获得了大量数据库工作者的关注。研究了基于编辑距离的相似性连接问题,即在两个字符串集合中寻找编辑距离小于一个阈值的字符串对,并在PASS-JOIn算法的基础上,提出了一个新的PASS-JOIn-k算法。PASS-JOIn-k算法在长短字符串上都有很好的表现。该算法的主要思想是利用PASS-JOIn算法的划分原理,以多次匹配的方式,达到更加严格地选取候选配对的目的。实验结果显示,PASS-JOIn-k算法减少了候选对的数量,在实际数据集上相比元算法在运行时间上有2~5倍的提升。; Similarity join is the basic model of data cleaning in the database research and has attracted lots of attention from the database community.This paper studies the edit distance based similarity join,which finds similar strings from two large sets of strings whose edit distance is less than a given threshold,and proposes an improved Pass-Join algorithm,named Pass-Join-K.Pass-Join-K is efficient both for short strings and long strings.The main idea of PassJoin-K is to divide the query string into more parts based on the partition strategy of Pass-Join,and filter the candidate string pairs more strictly by multi-match.The experimental results show that Pass-Join-K can decrease the candidate pairs,and run 2-5 times more quickly than the origin algorithm which outperforms state-of-the-art methods on real datasets.; 国家自然科学基金Nos.61102136;61001013; 福建省自然科学基金No.2011J05158; 深圳市科技创新基础研究No.JCYJ20120618155655087~~
语种zh_CN
内容类型期刊论文
源URL[http://dspace.xmu.edu.cn/handle/2288/106749]  
专题化学化工-已发表论文
推荐引用方式
GB/T 7714
余海洋,林琛,陈珂,等. Pass-Join-K:多分段匹配的相似性连接算法, Pass-Join-K: Similarity Join Method Based on Multi-Match Partition[J],2013.
APA 余海洋,林琛,陈珂,江弋,&邹权.(2013).Pass-Join-K:多分段匹配的相似性连接算法..
MLA 余海洋,et al."Pass-Join-K:多分段匹配的相似性连接算法".(2013).
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace