基于统计的汉字极限熵估测

CORC > 清华大学

	基于统计的汉字极限熵估测
	孙帆 ; 孙茂松 ; SUN Fan ; SUN Mao-Song
	2010-07-15 ; 2010-07-15
会议名称	中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集 ; 中国中文信息学会二十五周年学术会议 ; 中国北京 ; CNKI ; 中国中文信息学会
关键词	极限熵语言模型 n元串平滑技术线性插值 ultimate entropy language model n-gram smoothing linear interpolation TP391.1
其他题名	Statistical Estimation for Ultimate Entropy of Chinese Characters
中文摘要	文字符号的极限熵是在充分考虑上下文信息条件下,字符所包含平均信息量的大小。本文分别利用两种统计方法来估计汉字的极限熵:第一种方法通过计算汉字的n阶熵来逼近极限熵;第二种方法则通过建立统计语言模型,计算模型与平衡测试样本集之间的交叉熵给出汉字极限熵上界的估计。在实验中我们比较了这两种方法并得出结论:基于词的语言模型估计方法比基于字的直接计算方法得到了汉字墒的更为精确的估计,其熵值为5．31比特。实验中我们还使用了多种平滑技术对模型进行平滑,并比较了这些方法的优劣。; Ultimate entropy is the average information per character, taking the sufficient context into consideration. Two statistical methods for estimating the ultimate entropy of Chinese characters are introduced in this paper. The first method is to calculate the n-order entropy of characters to approach the ultimate entropy. The second one offers an estimate of upper bound for the entropy of Chinese characters by constructing a word-based statistical language model and then computing the cross-entropy between this model and a balanced corpus. We compare the two methods and reach the conclusion that, the word-based language modeling gets more accurate estimate for the ultimate entropy than character-based simply computing, leading to the best result of 5.31 bit. Several smoothing techniques are used to smooth our models, and their performances are compared then.
会议录出版者	清华大学出版社
语种	中文 ; 中文
内容类型	会议论文
源URL	[http://hdl.handle.net/123456789/70078]
专题	清华大学
推荐引用方式 GB/T 7714	孙帆,孙茂松,SUN Fan,等. 基于统计的汉字极限熵估测[C]. 见:中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集, 中国中文信息学会二十五周年学术会议, 中国北京, CNKI, 中国中文信息学会.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们