We conducted a case study which indicates that a semi-automated approach can achieve categorization performance close to the manual, expert system approach of building text categorization systems
Feature selection, perceptron learning, and a usability case study for text categorization
Special Interest Group on Information Retrieval, no. SI (1997): 67-73
In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. E...更多
下载 PDF 全文
- The phenomenal growth of the Internet has resulted in the availability of huge amounts of online information.
- Much of this information is in the form of natural language texts.
- A computer system that can categorize real-world, unrestricted English texts into a predeiined set of categories would be most useful.
- When tested on the standard Reuters text categorization collection, the approach outperforms the best pubiished results on this Reuters corpus
- We live in a world of information explosion
- We present an automated learning approach to building a robuste,fficient and practical text categorie tion system, called CLASSI, using tbe perception learning algorithm
- We describe a new feature selection metric, called correlation coetlicient, which yields considerable improvement in categorization accuracy
- Our evaluationhas shown that CLASSI outperforms existing appmdes onthestandard Reutera corpus
- We conducted a case study which indicates that a semi-automated approach can achieve categorization performance close to the manual, expert system approach of building text categorization systems
- By manually modifying and augmenting the set of words to be used as featurea m a topic c8tegoriaer, the authors achieve accuracy very close totlmmanual rtde-based approach.
- The authors achieved an F-measure accuracy of 0.522, which is still substantially lower than the accuracy of 0.733 achieved by TCS
- The authors have successfullybuilt a robust, efficient and practical text categorization system, CLASSI, using the perception learning algorithm.
- The authors' evaluationhas shown that CLASSI outperforms existing appmdes onthestandard Reutera corpus.
- The use of a new corrdation coefficient m feature selection results in considerable improvement in categon5 ation performance.
- The authors conducted a case study which indicates that a semi-automated approach can achieve categorization performance close to the manual, expert system approach of building text categorization systems
- Table1: The perception learning algorithm
- Table2: Effect of Feature Selection Method and Feature Set Size on Break-even point
- Table3: Results on the Reuters test corpus
- Table4: Successive improvements to CLASSIand Comparison with TCS
- [Apte et af., 1994] Chidanand Apte, Red Damerau, and Sholom M. Weiss. Automated learning of decision rules for text categorization. ACM 2hanmctions on lnforrnotion S@em-s, 12(3):233-251,July 1994.
- [Cohen and Sier, 1996] William w. Cohen and Yoram Singer. context-sensitive learning methods for text c-ategorization. In 19th International A CM SIGIR Conference on Reuearch and Development in hafomaation Retrieval, 1996.
- [Hayes et af., 1990] P.J. Hayes, P.M. An&men, I.B. Nburg, and L.M. Schmandt. TCS: A shell for content-based text categorization. In Proceedings of the Sisth IEEE Conjerence on Artificial Intelligence Applications, pages 320326, 1990.
- [Hearst et al., IW] Marti Hearst, Jan Pederaen, Peter Pirolli, Hinricb Schutze, Gregory Grefenstette, and David Hull. Xerox TREC4 site report. In Proceedings oj the Fourth Ted Retrieval Conference TREC-& 1996.
- [Hull, 1994] David Hull. Improving text retrieval for the routing problem using latent semantic indexing. In z 7th International ACM SIGIR Conference on Reaeamh and Development in Jn\ormation Retrieval, 1994.
- [Kohavi and John, 1995] Ron Kohavi and George H. John. Automatic parameter selection by minimiziw estimated error. In Machine Learning: Pmceedinga of the Twelfth lntemational Conjenmce, 1995.
- [Lewis and Ringuette, 1994] David Lewis and Marc Ringuette. A comparison of two learning algorithms for text categorization. In SVmposium on Document AnalVsi# and Information Retrieval, 1994.
- [Lewis et al., 1996] David D. Lewis, Robert E. !kha@e, James P. C&n, and Ron Papka. ‘lMning algorithms for linear text tilfiers. In 19th International ACM SIGIR Conference on Reseamh and Development in Information Retrieval, 1996.
- [Lewis, 1992] David Lewis. Representation and Learning in Information RetrievaL PhD thesis, Dept of Computer and Information Science, Univ of Masaadmsetts at Amherst, 1992.
- [Masand et al., 1992] Brij Masand, Gordon Linoff, and David Waltz. Chsifying news stories using memory baaed reasoning. In 15th International ACM SIGIR Confermce on Remxwch and Development in Infomaation Retrieval, 1992,
- [Miller, 1990] George A MMer. Five papers on WordNet. International Journal oj LexiwlogV, 3(4), 1990.
- [Mooney et aL, IW] Raymond J. Mooney, Jude W. ShavIik, G. Towell, and A. Gove. An experiement.al comparison of symbolic and connectionist learning algorithms. In Pmceedinga of the Eleventh International Joint Confenmce on Ati”jfcial Intelligence, pages 775-780, 1989.
- [Rijsbergen, 1979] C. J. Van Rijsbergen. Information Rettieval. Butterwortbs, London, 1979.
- [kcchio, 1971] J. ROCChiO. Relevance feedback information retrieval. In Gerard Salton, editor, The Smart Retrieval S@em - Experiments in Automatic Document Processing, pages 313-323. Prentice-Hall, Engk wood Cliffs, NJ, 1971.
- [Rosenblatt, 1958] F. Roeenblatt. The perception: A probabilistic model for information storage and organization in the brain. PsVchologiccd Review, 65:386-#8, 1958.
- [Schutze et aL, 1995] Hinrich Schutze, David A. Hull, and Jan O. Pedemen. A comparison of classifiers and document representations for the routing problem. In 18th International ACM SIGIR Conference on Reseamh and Development in Information Retrieval, 1995.
- [Wber et of., 1995] Erik Wkner, Jan O. Pedersen, and Andreas S. Weigend. A neural network approach to topic spotting. In Sympa~ium on Document Analyais and Informotion Retrieval, 1995.