Text Classification via iVector Based Feature Representation

Document Analysis Systems(2014)

引用 7|浏览76
暂无评分
摘要
In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.
更多
查看译文
关键词
gaussian mixture model,gmm,text dependent component,ubm,handwritten text classification,image representation,gaussian processes,historical typewritten text classification,discriminative classifier,universal background model,ivector space,fixed-length feature vector representation,text detection,ocr,mixture models,relative word error rate,ivector based feature representation,degraded noisy documents,feature extraction,image classification,support vector machine,ivector-svm classifier,arabic document database,handwritten character recognition,optical character recognition,wer,feature space,high-dimensional super vector,document image processing,support vector machines,text line classification,modern machine-printed text classification,vectors,hidden markov models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要