author profiling using complementary second order
play

Author Profiling using Complementary Second Order Attributes and - PowerPoint PPT Presentation

Introduction Proposed Method Experimental Results Conclusions and Future Work Author Profiling using Complementary Second Order Attributes and Stylometric Features Konstantinos Bougiatiotis* Anastasia Krithara Institute of Information and


  1. Introduction Proposed Method Experimental Results Conclusions and Future Work Author Profiling using Complementary Second Order Attributes and Stylometric Features Konstantinos Bougiatiotis* Anastasia Krithara Institute of Information and Telecommunication, N.C.S.R ”Demokritos”, Greece September 3, 2016 1 / 28

  2. Introduction Proposed Method Experimental Results Conclusions and Future Work Outline Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 2 / 28

  3. Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 3 / 28

  4. Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts 4 / 28

  5. Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts Age , gender , personality traits, emotions 4 / 28

  6. Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts Age , gender , personality traits, emotions Marketing, Security, Forensics, ... 4 / 28

  7. Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts Age , gender , personality traits, emotions Marketing, Security, Forensics, ... Pan’16 Languages: English, Spanish and Dutch(gender only) Focus on cross-genre evaluation 4 / 28

  8. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 5 / 28

  9. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification General Workflow Aggregate tweets Tweets of each user raw tweets - Clean Html - Detwittify Preprocessing - Remove Numbers - Remove Punctuation clean tweets Document-Profile Features Stylometry Features - Model used in PAN’15 - Second Order Attributes extracted features Feature Concatenation Support Vector Machine 6 / 28

  10. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 7 / 28

  11. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach 8 / 28

  12. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Sample Tweet Thanks for the follow back <a href="/WolfgangDigital" Raw Tweet: Noisy data, HTML tags, links, etc class="twitter-atreply pretty-link js-nav" data-mentioned-user-id="391 869708" ><s>@</s><b>WolfgangDigital </b></a> I&#39;ll be keeping an eye out for any vacancies you advertise in the near future. 8 / 28

  13. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Raw Tweet: Noisy data, Sample Tweet HTML tags, links, etc Thanks for the follow back Cleaning HTML @WolfgangDigital I&#39;ll be keeping an eye out for any vacancies you advertise in the near future. 8 / 28

  14. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Raw Tweet: Noisy data, Sample Tweet HTML tags, links, etc Thanks for the follow back Cleaning HTML I&#39;ll be keeping an eye Detwittify (remove out for any vacancies you hashtags, replies etc) advertise in the near future. 8 / 28

  15. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Raw Tweet: Noisy data, Sample Tweet HTML tags, links, etc Thanks for the follow back Cleaning HTML I ll be keeping an eye out Detwittify (remove for any vacancies you hashtags, replies etc) advertise in the near Remove all non-letter future characters (numbers, ...) 8 / 28

  16. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 9 / 28

  17. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Stylometric and Structural Features - PAN’15 Experimented with many features: Profiling Features Structural Stylometry Number of Number of Number of Tf-idf of Bag of Ngram Word length Number of Hashtags Links Mentions Ngrams Smileys Graphs Uppercase Finally settled on term-frequencies 3-grams (age) and unigrams (gender) 10 / 28

  18. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Second Order Attributes-SOA Idea originally from PAN’13 winning Team (INAOE, Mexico) 1 2-step method , similar approach to Naive Bayes 1 L´ opez-Monroy et al.: INAOE’s participation at PAN’13: Author Profiling task Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and Workshop 11 / 28

  19. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Second Order Attributes-SOA Idea originally from PAN’13 winning Team (INAOE, Mexico) 1 2-step method , similar approach to Naive Bayes Intuition 1 Associate the different terms in our collection with target profiles (age or gender classes) → Calculate words-classes vectors based on word frequency 1 L´ opez-Monroy et al.: INAOE’s participation at PAN’13: Author Profiling task Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and Workshop 11 / 28

  20. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Second Order Attributes-SOA Idea originally from PAN’13 winning Team (INAOE, Mexico) 1 2-step method , similar approach to Naive Bayes Intuition 1 Associate the different terms in our collection with target profiles (age or gender classes) → Calculate words-classes vectors based on word frequency 2 Project the documents in the profile space according to the weighted aggregation of their terms → Calculate document-classes vectors 1 L´ opez-Monroy et al.: INAOE’s participation at PAN’13: Author Profiling task Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and Workshop 11 / 28

  21. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Example of Age Specific Terms 12 / 28

  22. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Example of Gender Specific Terms 13 / 28

  23. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Example illustration of generated SOA 14 / 28

  24. Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Weighted SOAComplementary Novelties introduced: � Use complementary classes documents for each word-class relation Intuition Counter skewed class distribution of data → Use complementary classes for each term-profile relation → More even amount of data for each class → Robust estimates and lesser bias 15 / 28

Recommend


More recommend