CORC  > 北京大学  > 信息科学技术学院
Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features
Chen, Chong ; Yan, Hongfei ; Li, Xiaoming
2008
关键词Digital resource classification feature probability estimation
英文摘要With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB's conditional independence hypothesis well; 2) the abound one-time-occurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and evaluate the model.; http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000261730100016&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=8e1609b174ce4e31116a60747a720701 ; Computer Science, Artificial Intelligence; Computer Science, Information Systems; EI; CPCI-S(ISTP); 1
语种英语
DOI标识10.1007/978-3-540-89447-6_18
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/406849]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Chen, Chong,Yan, Hongfei,Li, Xiaoming. Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features. 2008-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace