combining lsi with other classifiers to improve accuracy
play

Combining LSI with other Classifiers to Improve Accuracy of - PowerPoint PPT Presentation

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T ecnico Technical University of Lisbon / INESC-ID EWLSATEL, March 2007


  1. Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T´ ecnico — Technical University of Lisbon / INESC-ID EWLSATEL, March 2007 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 1 / 11

  2. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  3. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  4. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  5. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  6. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  7. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  8. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  9. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  10. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  11. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  12. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  13. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  14. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  15. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  16. Classification Methods Cosine similarity Vector k-NN + Cosine similarity k-NN Voting strategy Kernel SVM p dimensional term space Cosine similarity SVD LSI s << p dimensional concept space (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 4 / 11

  17. Combinations Between Methods k-NN + Cosine similarity k-NN-LSI SVD Kernel + Voting strategy SVM-LSI s << p p dimensional dimensional term space concept space (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 5 / 11

  18. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  19. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  20. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  21. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  22. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  23. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  24. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  25. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  26. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  27. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  28. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  29. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  30. Experimental Results 1.0 Dumb Vector 0.8 k-NN SVM 0.6 LSI k-NN-LSI 0.4 SVM-LSI 0.2 0.0 Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

Recommend


More recommend