基于网页格式信息量的博客文章和评论抽取模型

CORC > 厦门大学 > 信息技术－已发表论文

	基于网页格式信息量的博客文章和评论抽取模型; Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction
	曹冬林 ; 廖祥文 ; 许洪波 ; 白硕
	2009
关键词	博客信息抽取最小正文子树有效信息率网页格式信息视觉信息切分位置信息量 blog information extraction minimal main text subtree effective information ratio Web format information vision information information quantity of separate position
英文摘要	从信息论的角度出发,提出了一个基于网页格式信息量的博客文章和评论抽取模型.首先,结合网页视觉上的位置信息和文本的有效信息来定位网页正文.其次,利用博客网页中的格式信息作为信息单元并计算每个信息块所包含的格式信息量,通过计算最小切分位置信息量来切分正文中的文章和评论.该模型具有与语言无关的特点,因此具有一定的通用性.实验结果表明,该模型在博客正文定位和正文切分方面达到了较高的精确率.; Based on the information theory,this paper presents a model based on Web format information quantity in blog information extraction.First,the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page.Second,the format information of blog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text.This model is language insensitive and can be used in a lot of blogs which are written in different natural languages.Experimental results show that this method achieves high precision in locating main text and separating the post and comment.; 国家重点基础研究发展计划(973)Nos.2004CB318109;2007CB311100;国家高技术研究发展计划(863)No.2007AA01Z441----
语种	zh_CN
内容类型	期刊论文
源URL	[http://dspace.xmu.edu.cn/handle/2288/122474]
专题	信息技术－已发表论文
推荐引用方式 GB/T 7714	曹冬林,廖祥文,许洪波,等. 基于网页格式信息量的博客文章和评论抽取模型, Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction[J],2009.
APA	曹冬林,廖祥文,许洪波,&白硕.(2009).基于网页格式信息量的博客文章和评论抽取模型..
MLA	曹冬林,et al."基于网页格式信息量的博客文章和评论抽取模型".(2009).

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们