Analyzing the Effect of Dimensionality Reduction in Document Categorization for Basque

Archives of Control Sciences(2005)

引用 26|浏览43
This paper analyzes the incidence that dimensionality reduction techniques have in the process of text categorization of documents written in Basque. Classification techniques such as Na¨ ive Bayes, Winnow, SVMs and k-NN have been selected. The Singular Value Decomposition (SVD) dimensionality reduction technique together with lemmatization and noun selection have been used in our exper- iments. The results obtained show that the approach which combines SVD and k-NN for a lemmatized corpus gives the best accuracy rates of all with a remarkable difference. ferent corpora in both experiments: words, lemmas and nouns. Obtained results show that the SVD dimensional- ity reduction technique combined with the k-NN classifi- cation algorithm gives the best results. Moreover, we find that they are obtained for the lemmatized corpus. This paper is structured as follows. First, we reference previous work on algorithms we use for document catego- rization, and examine the foundations of LSI. Afterwards, the experimental setup is introduced, where both training and test corpora are described and lemmatization, noun selection and document frequency based feature selection processes are introduced. In the next section, experimen- tal results are shown, compared and discussed. Finally, conclusions and future work are presented.
singular value decomposition
AI 理解论文
Chat Paper