基于词素的日文分词方法及其在OCR系统中的应用

CORC > 清华大学

	基于词素的日文分词方法及其在OCR系统中的应用
	金春实 ; 丁晓青 ; 彭良瑞 ; 刘长松 ; Chunshi Ding ; Xiaoqing Peng ; Liangrui Liu ; Changsong Jin
	2010-06-09 ; 2010-06-09
关键词	分词词素日文词尾变化 OCR检错 word segmentation morpheme Japanese declension OCR error-detecting TP391.1
其他题名	Morpheme-based Method for Japanese Word Segmentation and Its Application in OCR
中文摘要	在基于OCR技术的大规模文档录入系统中,自动检错可以大大降低人工校对成本。在日文OCR系统自动检错中,日文单词因其动词及形容词、形容动词的词尾变化现象使自动分词变得比较困难。本文提出了一种基于词素的日文分词新方法,通过建立以词素为基础单位的分词词库,以最大长度优先词条匹配方法分割出文章中有词尾变化的日文单词,避免了传统日文分词中收录单词各种词尾变化形式造成分词词库过于庞大的问题。实验表明,本文提出的分词方法可以达到99.0%的分词正确率;将该方法运用在OCR检错模块,当系统拒识率(即检错模块中认为可疑的字符在总字符中的比例)控制在1/5时,测试集上漏检率为0.05%,说明了该方法的有效性。; Automatic error-detecting module can largely reduce manpower cost in OCR-based mass data entry systems. In error-detecting module for Japanese OCR system, as Japanese language has declensions for verbs, adjectives and quasi-adjectives,it brings difficulty to word segmentation method. This paper presents a new morpheme-based Japanese word segmentation method, which uses morpheme as the basic unit of the word segmentation database, and adopts the maximum-length-first word matching method in segmentation process. This method avoids the huge word database problem in traditional Japanese word segmentation method which collects all declensions forms of words. Experiments show this method is effective. The average correct rate of segmentation is 99.0% on tested corpus. When applied to Japanese OCR error-detecting module, the residual error rate on the test suite could be 0.05% with 1/5 characters rejection rate.; 国家自然科学基金(项目编号60472002)的资助。
语种	中文 ; 中文
内容类型	期刊论文
源URL	[http://hdl.handle.net/123456789/54885]
专题	清华大学
推荐引用方式 GB/T 7714	金春实,丁晓青,彭良瑞,等. 基于词素的日文分词方法及其在OCR系统中的应用[J],2010, 2010.
APA	金春实.,丁晓青.,彭良瑞.,刘长松.,Chunshi Ding.,...&Changsong Jin.(2010).基于词素的日文分词方法及其在OCR系统中的应用..
MLA	金春实,et al."基于词素的日文分词方法及其在OCR系统中的应用".(2010).

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们