Introduction Proposed Method Experimental Results Conclusions and Future Work Author Profiling using Complementary Second Order Attributes and Stylometric Features Konstantinos Bougiatiotis* Anastasia Krithara Institute of Information and Telecommunication, N.C.S.R ”Demokritos”, Greece September 3, 2016 1 / 28
Introduction Proposed Method Experimental Results Conclusions and Future Work Outline Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 2 / 28
Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 3 / 28
Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts 4 / 28
Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts Age , gender , personality traits, emotions 4 / 28
Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts Age , gender , personality traits, emotions Marketing, Security, Forensics, ... 4 / 28
Introduction Proposed Method Overview Experimental Results Conclusions and Future Work Introduction Author Profiling Find specific characteristics of authors , by studying their texts Age , gender , personality traits, emotions Marketing, Security, Forensics, ... Pan’16 Languages: English, Spanish and Dutch(gender only) Focus on cross-genre evaluation 4 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 5 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification General Workflow Aggregate tweets Tweets of each user raw tweets - Clean Html - Detwittify Preprocessing - Remove Numbers - Remove Punctuation clean tweets Document-Profile Features Stylometry Features - Model used in PAN’15 - Second Order Attributes extracted features Feature Concatenation Support Vector Machine 6 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 7 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach 8 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Sample Tweet Thanks for the follow back <a href="/WolfgangDigital" Raw Tweet: Noisy data, HTML tags, links, etc class="twitter-atreply pretty-link js-nav" data-mentioned-user-id="391 869708" ><s>@</s><b>WolfgangDigital </b></a> I'll be keeping an eye out for any vacancies you advertise in the near future. 8 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Raw Tweet: Noisy data, Sample Tweet HTML tags, links, etc Thanks for the follow back Cleaning HTML @WolfgangDigital I'll be keeping an eye out for any vacancies you advertise in the near future. 8 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Raw Tweet: Noisy data, Sample Tweet HTML tags, links, etc Thanks for the follow back Cleaning HTML I'll be keeping an eye Detwittify (remove out for any vacancies you hashtags, replies etc) advertise in the near future. 8 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Tweets Concatenate the tweets of each user � Profile Based Approach Raw Tweet: Noisy data, Sample Tweet HTML tags, links, etc Thanks for the follow back Cleaning HTML I ll be keeping an eye out Detwittify (remove for any vacancies you hashtags, replies etc) advertise in the near Remove all non-letter future characters (numbers, ...) 8 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Introduction 1 Overview Proposed Method 2 General Workflow Preprocessing Feature Extraction Classification Experimental Results 3 PAN’16 Data Results on Train Data Results on Test Data Conclusions and Future Work 4 9 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Stylometric and Structural Features - PAN’15 Experimented with many features: Profiling Features Structural Stylometry Number of Number of Number of Tf-idf of Bag of Ngram Word length Number of Hashtags Links Mentions Ngrams Smileys Graphs Uppercase Finally settled on term-frequencies 3-grams (age) and unigrams (gender) 10 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Second Order Attributes-SOA Idea originally from PAN’13 winning Team (INAOE, Mexico) 1 2-step method , similar approach to Naive Bayes 1 L´ opez-Monroy et al.: INAOE’s participation at PAN’13: Author Profiling task Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and Workshop 11 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Second Order Attributes-SOA Idea originally from PAN’13 winning Team (INAOE, Mexico) 1 2-step method , similar approach to Naive Bayes Intuition 1 Associate the different terms in our collection with target profiles (age or gender classes) → Calculate words-classes vectors based on word frequency 1 L´ opez-Monroy et al.: INAOE’s participation at PAN’13: Author Profiling task Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and Workshop 11 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Second Order Attributes-SOA Idea originally from PAN’13 winning Team (INAOE, Mexico) 1 2-step method , similar approach to Naive Bayes Intuition 1 Associate the different terms in our collection with target profiles (age or gender classes) → Calculate words-classes vectors based on word frequency 2 Project the documents in the profile space according to the weighted aggregation of their terms → Calculate document-classes vectors 1 L´ opez-Monroy et al.: INAOE’s participation at PAN’13: Author Profiling task Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and Workshop 11 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Example of Age Specific Terms 12 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Example of Gender Specific Terms 13 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Example illustration of generated SOA 14 / 28
Introduction General Workflow Proposed Method Preprocessing Experimental Results Feature Extraction Conclusions and Future Work Classification Weighted SOAComplementary Novelties introduced: � Use complementary classes documents for each word-class relation Intuition Counter skewed class distribution of data → Use complementary classes for each term-profile relation → More even amount of data for each class → Robust estimates and lesser bias 15 / 28
Recommend
More recommend