Unsupervised style classification of document page images

ICIP (2)（2005）

引用 5|浏览31

暂无评分

摘要

Style classification of document page images is crucial for logical structure analysis of heterogeneous collections of documents. Both layout and contextual features contain significant information about document styles. Most existing methods are supervised methods in which specific document models or classifiers are learned from a training set of document page images with known class labels. In this paper, we propose an unsupervised classification method that involves no training or manual selection of algorithm parameters. In particular, we first represent each document page as an ordered labeled X-Y tree. A tree matching algorithm is then used to compute style dissimilarity between two document pages. We propose a set of tree edit cost functions based on Karl Pearson distance between two multivariate feature observations, which is robust to the over-segmentation problem and zone length variations of same logical entities. Finally, the K-medoids algorithm is used to find an optimal grouping of the trees into K clusters, each of which corresponds to a distinct document style. We evaluate our algorithm on test datasets with different cluster sizes and degrees of style similarity. Experimental results show our algorithm achieved an average classification accuracy of 95.69% over six datasets consisting of 150 pages of 11 different styles.

查看译文

关键词

tree matching algorithm,image representation,unsupervised style classification,document page images,logical structure analysis,image classification,document image processing,karl pearson distance,k-medoids algorithm,structure analysis,cost function

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要