CORC  > 北京大学  > 信息科学技术学院
Enhancing domain portability of Chinese segmentation model using chi-square statistics and bootstrapping
Chang, Baobao ; Han, Dongxu
2010
英文摘要Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reliable segmentation techniques are always required to make sure those tasks well-performed. In recent years, machine learning and sequence labeling models such as Conditional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional lexicon-driven models, machine learned models achieve higher F-measure scores. But machine learned models heavily depend on training materials. Although they can effectively process texts from the same domain as the training texts, they perform relatively poorly when texts from new domains are to be processed. In this paper, we propose to use ??2 statistics when training an SVM-HMM based segmentation model to improve its ability to recall OOV words and then use bootstrapping strategies to maintain its ability to recall IV words. Experiments show the approach proposed in this paper enhances the domain portability of the Chinese word segmentation model and prevents drastic decline in performance when processing texts across domains. ? 2010 Association for Computational Linguistics.; EI; 0
语种英语
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/329586]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Chang, Baobao,Han, Dongxu. Enhancing domain portability of Chinese segmentation model using chi-square statistics and bootstrapping. 2010-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace