Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T´ ecnico — Technical University of Lisbon / INESC-ID EWLSATEL, March 2007 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 1 / 11
Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11
Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11
Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11
Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11
Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11
Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11
Classification Methods Cosine similarity Vector k-NN + Cosine similarity k-NN Voting strategy Kernel SVM p dimensional term space Cosine similarity SVD LSI s << p dimensional concept space (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 4 / 11
Combinations Between Methods k-NN + Cosine similarity k-NN-LSI SVD Kernel + Voting strategy SVM-LSI s << p p dimensional dimensional term space concept space (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 5 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11
Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11
Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11
Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11
Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11
Experimental Results 1.0 Dumb Vector 0.8 k-NN SVM 0.6 LSI k-NN-LSI 0.4 SVM-LSI 0.2 0.0 Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11
Recommend
More recommend