脱机手写中文文档分析方法研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	脱机手写中文文档分析方法研究
作者	殷飞
学位类别	工学博士
答辩日期	2010-06-01
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	刘成林
关键词	脱机手写文档分析文本行分割变分贝叶斯高斯混合模型字符串对齐几何上下文 offline handwritten document analysis text line segmentation variational Bayes Gaussian mixture model character string alignment geometric context
其他题名	Methods for Unconstrained Offline Handwritten Chinese Document Analysis
学位专业	模式识别与智能系统
中文摘要	随着个人计算机性能的迅速提高以及各种数字化设备的出现，社会正处于一个从纸质文档向电子文档过渡的时代。大量纸质文档的电子化工作仍然是人类面临的一项重要技术挑战。由于手写文档书写结构的不规范以及书写风格的随意性，导致那些在处理印刷文档时表现良好的算法在处理手写文档时的效果不尽人意。尽管手写文字的识别工作已经展开了近半个世纪，但是直到最近这些年无约束手写文档分析的研究才逐渐受到重视。针对中文手写文档的识别，本文对手写文档分析的三个关键问题进行了深入的研究，它们分别是：手写文本行分割、文档数据库标记和几何上下文模型。本文的主要贡献如下： [1]为了分割手写文档中的文本行，我们提出了一种基于距离测度学习和最小张树的文本行分割算法，首先在一个给定的距离测度下，将文档中的连通部件聚类为一个树结构，然后在超体积下降标准和笔直度度量这两个准则的约束下，动态地将那些文本行之间的连接边剔除掉，从而获得提取到的文本行。在标记为连通部件成对约束的训练数据上通过监督学习而获取的距离测度，使得本文提出的算法能够轻松地提取弯曲和多方向的文本行。最后，在一个包含803幅文档的手写数据库上的实验证明了我们算法的有效性。 [2]本文还提出了一个基于变分贝叶斯高斯混合模型的文本行分割算法。算法中将文档图像看作是对每一个文本行都用一个对应高斯成分来模拟的混合模型。我们使用变分期望最大化算法（Variational Bayes Expectation-Maximization ，VBEM）来估计文档模型中每个成分的参数，同时我们还扩展了该算法，使其能够自动地消除和分裂成分来估计模型中混合成分的个数，最后通过在中文手写文档上的实验证明了我们算法的有效性。 [3]为了降低在大规模中文文档数据库的标记工作中的人工工作量，我们开发了一个能够自动完成文本行和字符信息标记的工具GTLC（Ground-Truthing Text Lines and Characters），在该工具中，我们使用一个文本行分割算法完成文本行地标记；同时使用一个字符串对齐算法完成字符的切分和标记；在该对齐算法中将字符串的对齐转化为一个融合字符切分和识别的最优化过程。在工具中还设计了多种交互式的操作，用以修改自动文本分行和文字对齐的遗留错误。我们通过实验检验了该工具的有效性，并用它标记了大量的脱机中文手写文档。 [4]我们提出了一种中文字符几何上下文的描述方法，并利用它来提高文档标记时文本行的对齐精度。具体地，我们使用四个统计模型来估计单个字符和相邻字符之间的几何特征，通过将几何模型和文字识别器有效地结合在一个统一的框架下，在实验中显著地提高了无约束手写中文文本行的对齐精度。
英文摘要	With the development of computing technology and the emergence of a variety of digitization devices, people are stepping from the paper document space to the electronic document space. The digitization of large volume of paper documents remains a heavy burden and a technical challenge. The methods used to work on printed documents do not work well on unconstrained handwritten documents due to the layout irregularity and the writing style variability. Handwritten character recognition has been studied for nearly a half century, but the analysis of unconstrained handwritten documents is receiving intensive efforts only in recent years. Aiming for automated Chinese handwritten document recognition, this thesis studies into three key issues of handwritten document analysis: text line segmentation, document annotation and geometric context modeling. The main contributions of this thesis are as follows. [1] For separating text lines in unconstrained handwritten documents, we propose a novel text line segmentation algorithm based on Minimal Spanning Tree (MST) clustering with distance metric learning. Given a distance metric, the connected components of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hypervolume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of connected components, the proposed algorithm is made robust to handle various documents with multi-skewed and curved text lines. The proposed algorithm is demonstrated superior on a database with 803 unconstrained handwritten Chinese document images. [2] We propose a robust text line segmentation algorithm based on variational Bayes Gaussian mixture density estimation. Viewing the document image as a mixture density model, with each text line approximated by a Gaussian component, we use the Variational Bayes Expectation-Maximization (VBEM) method to automatically determine the number of components, and extend the VBEM method such that it can both eliminate and split components. Experiments on Chinese handwritten document images demonstrated the effectiveness of the approach. [3] For annotating large volume of handwritten documents incurring low human efforts, we develop a ground-truthing tool GTLC, which can automatically segment and annotate the text lines and characters in handwritten document images. We use a text line segmentati...
语种	中文
其他标识符	200618014628051
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/6282]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	殷飞. 脱机手写中文文档分析方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2010.