An Empirical Comparison of Text Categorization Methods Ana Cardoso-Cachopo and Arlindo L. Oliveira acardoso@gia.ist.utl.pt and aml@inesc-id.pt Instituto Superior T´ ecnico / ALGOS-INESC-ID An Empirical Comparison of Text Categorization Methods – p. 1/16
Outline Data sets Information Retrieval methods Evaluation Experimental setup Results Conclusions An Empirical Comparison of Text Categorization Methods – p. 2/16
Data sets C10 (in Portuguese) 461 help desk messages, with answers 10 classes, 34 to 58 messages each mini20 (in English) 2000 subset of 20Newsgroups 100 messages for each newsgroup pre-processing Discard words shorter than 3 characters Discard words longer than 20 characters Remove numbers and non-letter characters Case and special character unification An Empirical Comparison of Text Categorization Methods – p. 3/16
IR methods Vector model Latent Semantic Analysis/Indexing (LSA) Support Vector Machines (SVM) k-NN Vector k-NN LSA An Empirical Comparison of Text Categorization Methods – p. 4/16
� IR methods — Vector Words terms Docs are vectors in an N-dimensional space Similarity between docs is the cosine of the angle formed by the vectors representing the docs A doc’s class is the class of the most similar doc An Empirical Comparison of Text Categorization Methods – p. 5/16
� IR methods — LSA Words terms Docs are vectors in an N-dimensional space Apply Singular Value Decomposition (M N)-dimensional space representing concepts Similarity between docs is the cosine of the angle formed by the vectors representing the docs in this lower-dimensional space A doc’s class is the class of the most similar doc An Empirical Comparison of Text Categorization Methods – p. 6/16
� IR methods — SVM Words terms Docs are vectors in an N-dimensional space Transform the space using a kernel function Find a decision surface for each class that separates it from the others One-against-one or one-against-all approach for multiclass problems A doc belongs to the class that had more “belongs” votes An Empirical Comparison of Text Categorization Methods – p. 7/16
� IR methods — k-NN Vector / k-NN LSA Words terms Docs are vectors in an N-dimensional space A doc’s class is the most wheighted class among its k neighbours The weight is the cosine similarity in the Vector/LSA space An Empirical Comparison of Text Categorization Methods – p. 8/16
IR methods — overview Vector y kNN Vector t i kNN (cos. sim.) r a l i m Cosine similarity LSA i M<<N dimens. concept space s e n i words ~ terms s o C N−dimens. term space kNN (cos. sim.) D V S kNN LSA RBF kernel Voting strategy SVM An Empirical Comparison of Text Categorization Methods – p. 9/16
Evaluation Text Categorization task Each document has ONE category (Recall is not important) The rank of the first correct answer is important (Precision is not enough) Preferably one single number Mean Reciprocal Rank (MRR) The MRR of each individual query is the reciprocal of the rank at which the first correct response was returned, or 0 if none of the first responses contained a correct answer. The score for a sequence of queries is the mean of the individual query’s reciprocal ranks. An Empirical Comparison of Text Categorization Methods – p. 10/16
Experimental setup Documents IREP Read documents IREP documents External Pakages IGLU (Vector) Filter documents FAQO (LSA/I) Filtered documents LIBSVM (SVMs) Test IR models ... IREP results Write results Results An Empirical Comparison of Text Categorization Methods – p. 11/16
Results – C10 1 0.95 0.9 Mean Reciprocal Rank 0.85 0.8 LSA 0.75 k-NN LSA Vector k-NN Vector SVM 0.7 0.65 0 500 1000 1500 2000 2500 All terms Number of terms An Empirical Comparison of Text Categorization Methods – p. 12/16
Results – mini20 0.85 0.8 0.75 0.7 Mean Reciprocal Rank 0.65 0.6 0.55 0.5 LSA k-NN LSA Vector k-NN Vector 0.45 SVM 0.4 0.35 0 500 1000 1500 2000 2500 All terms Number of terms An Empirical Comparison of Text Categorization Methods – p. 13/16
✂ � � ✄ ✁ ✂ ✄ ✁ Significance Tests – C10 Results of the t-test for dataset C10 SVM k-NN LSA LSA k-NN Vector Vector SVM – , 0.1581 , 0.0001 , 0.0004 , 0.0012 k-NN LSA – , 0.0001 , 0.0007 , 0.0020 LSA – , 0.0026 , 0.0038 k-NN Vector – , 0.7221 Vector – k-NN LSA SVM Vector k-NN Vector LSA An Empirical Comparison of Text Categorization Methods – p. 14/16
Significance Tests – mini20 Results of the t-test for dataset mini20 SVM k-NN LSA LSA k-NN Vector Vector SVM – , 0.0010 , 0.0001 , 0.0000 , 0.0001 k-NN LSA – , 0.0001 , 0.0000 , 0.0000 LSA – , 0.0004 , 0.0000 k-NN Vector – , 0.0079 Vector – SVM k-NN LSA LSA Vector k-NN Vector An Empirical Comparison of Text Categorization Methods – p. 15/16
� Conclusions 2500 is a good upper-bound for the number of terms k-NN LSA SVM Both are significantly better than the others MRR is useful for one-class Text Categorization tasks An Empirical Comparison of Text Categorization Methods – p. 16/16
Recommend
More recommend