Author Profiling PAN-AP-2014 - CLEF 2014 Sheffield, 15-18 September - PowerPoint PPT Presentation

Author Profiling PAN-AP-2014 - CLEF 2014 Sheffield, 15-18 September 2014 Martin Potthast, Martin Ben Verhoeven, Francisco Rangel Paolo Rosso Irina Chugur Trenkmann, Benno Stein Walter Daelemans Autoritas / Universitat Universitat Politècnica UNED Politècnica de València de València Bauhaus-Universität Weimar University of Anwerp

What’s Author Profiling? Personality Gender? traits? Age? Emotions? Native language? Author Profile... Who is who? 2

Why Author Profiling? Forensics Security Marketing Profile Language as Segmenting possible evidence users delinquents 3

Task Goal ‣ Given a collection of documents retrieved from different Social Media in English and Spanish... To identify age and gender 4

Related Work on Author Profiling (age & gender) OTHER AUTHOR COLLECTION FEATURES RESULTS CHARACTERISTICS Argamon et al., 2002 British National Corpus Part-of-speech Gender: 80% accuracy Holmes & Meyerhoff, Formal texts - Age and gender 2003 Burger & Henderson, Posts length, capital letters, They only reported: Blogs “Low percentage errors” Two age classes: [0,18[,[18,-] 2006 punctuations. HTML features. Simple lexical and syntactic Koppel et al., 2003 Blogs Gender: 80% accuracy Self-labeling functions Stylistic features + content words Gender: 80% accuracy Schler et al., 2006 Blogs with the highest information gain Age: 75% accuracy Gender: 89.18 accuracy Goswami et al., 2009 Blogs Slang + sentence length Age: 80.32 accuracy Words, punctuation, average Zhang & Zhang, 2010 Segments of blog words/sentence length, POS, word Gender: 72,10 accuracy factor analysis Correlation: 0.74 Nguyen et al., 2011 y Manual labeling Blogs & Twitter Unigrams, POS, LIWC Mean absolute error: 4.1 2013 Age as continuous variable - 6.8 years Unigrams, bigrams, trigrams and Gender+Age: 88.8 Self-labeling, min 16 plus Peersman et al., 2011 Netlog tetagrams accuracy 16,18,25 5

News on PAN-AP 2014 News on Author Profiling PAN-Replab Collaboration New Datasets PAN-AP13 -> Social Media TripAdvisor (EN) Blogs Two PAN virtual machines complementary Twitter (with Replab) for RepLab participants perspectives TIRA platform @ Weimar of Author Proflining All participants with the same computing power Increases participants engagement Improves Sustainability, Replicability and Reproducibility Allows cross-year evaluations 6

Di ffj culty of collecting data ‣ Big Data? ‣ High variety of themes ‣ Real people vs. Robots (chatbots) ‣ Multilingual: English + Spanish + ... ‣ Difficulty to obtain (automatically) good label data ‣ Manual annotation? 7

Corpus Social Media Blogs Twitter Hotel reviews ‣ Manually annotated ‣ Manually annotated (3 independent (3 independent ‣ Subset of PAN-AP13 ‣ TripAdvisor annotations) annotations) ‣ N. words > 100 ‣ Personal accounts ‣ N. words > 10 ‣ Personal blogs ‣ Manual review ‣ Up to 1000 tweets ‣ Manual review ‣ Up to 25 posts ‣ Tweet Id. ‣ Rss content ‣ Replab collaboration English English Spanish Balanced by nced by gender Ag Age groups: 18-24; 25-34; 25-34; 35-49; 50-64; 65+ 65+ 8

Corpus - Social Media NUMBER OF AUTHORS RS LANG A LANG A G AGE GEN G AGE GEN E GENDER E GENDER TRAINING EARLY BIRDS TEST 18-24 1,550 140 680 25-34 2,098 180 900 MALE / MALE / EN 35-49 2,246 200 980 FEMALE FEMALE 50-64 1,838 160 790 65+ 14 12 26 7,746 692 3,376 18-24 330 30 150 25-34 426 36 180 MALE / MALE / ES 35-49 324 28 138 FEMALE FEMALE 50-64 160 14 70 65+ 32 14 28 1,272 122 566 9

Corpus - Blogs NUMBER OF AUTHORS RS LANG A LANG A G AGE GEN G AGE GEN E GENDER E GENDER TRAINING EARLY BIRDS TEST 18-24 6 4 10 25-34 60 6 24 MALE / MALE / EN 35-49 54 8 32 FEMALE FEMALE 50-64 23 4 10 65+ 4 2 2 147 24 78 18-24 4 2 4 25-34 26 4 12 MALE / MALE / ES 35-49 42 4 26 FEMALE FEMALE 50-64 12 2 10 65+ 4 2 2 88 14 56 10

Corpus - Twitter NUMBER OF AUTHORS RS LANG A LANG A G AGE GEN G AGE GEN E GENDER E GENDER TRAINING EARLY BIRDS TEST 18-24 20 2 12 25-34 88 6 56 MALE / MALE / EN 35-49 130 16 58 FEMALE FEMALE 50-64 60 4 26 65+ 8 2 2 306 30 154 18-24 12 2 4 25-34 42 4 26 MALE / MALE / ES 35-49 86 12 46 FEMALE FEMALE 50-64 32 6 12 65+ 6 2 2 178 26 90 11

Corpus - Hotel reviews NUMBER OF OF AUTHORS LANG LANG ANG AGE GEN ANG AGE GEN E GENDER E GENDER TRAINING TEST 18-24 180 74 25-34 500 200 MALE / MALE / EN 35-49 500 200 FEMALE FEMALE 50-64 500 200 65+ 400 147 2,080 821 12

Corpus (test) GENDER ER / AGE SOCIAL MED IAL MEDIA BLOG BLOGS TWITT ITTER REVIEWS EN ES EN ES EN ES EN 18-24 340 75 5 2 6 2 74 25-34 450 90 12 6 28 13 200 FEMALE 35-49 490 69 16 13 29 23 200 50-64 395 35 5 5 13 6 200 65+ 13 14 1 1 1 1 147 18-24 340 75 5 2 6 2 86 25-34 450 90 12 6 28 13 250 MALE 35-49 490 69 16 13 29 23 302 50-64 395 35 5 5 13 6 268 65+ 13 14 1 1 1 1 178 3376 566 78 56 154 90 1905 13

Identification accuracies ENGLISH SPANISH Accuracy for Accuracy for Accuracy for Accuracy for Gender Age Gender Age Joint Accuracy Joint Accuracy Average Accuracy per subcorpus (SM, Blog, TW, Trip) 14

Participants’ ranking Accuracy for Accuracy for Accuracy for Accuracy for Social Media Blogs Twitter Hotel Reviews Average Accuracy WINNER OF THE TASK BASELINE: The 1000 most frequent character trigrams with SVM 15

Statistical significance Approximate randomisation testing* *Eric W. Noreen. Computer intensive methods for testing hypotheses: an introduction. Wiley, New York, 1989. Pairwise comparison of accuracies of all systems p < 0.05 -> the systems are significantly different 16

Distances in age misidentification 18-24 25-34 35-49 50-64 65+ Truth 3 4 2 1 0 18-24 25-34 35-49 50-64 65+ Predicted ‣ Missing predictions penalised with distance equal to 5 ‣ Standard deviation of all the individual distances 17

Participants ‣ 10 participants ‣ 8 countries ‣ 8 papers 18

Approaches ‣ What kind of ... Preprocessing Features Methods ... did the teams perform? 19

Approaches Preprocessing 5 teams: [shrestha][marquardt][baker] HTML Cleaning to obtain plain text [ashok][weren] Deletion of URLs, hashtags and user 1 team: [ashok] mentions in Twitter Case conversion, invalid characters, 2 team: [baker][weren] multiple white spaces... Tokenisation 2 teams: [villenaroman][weren] Subset selection 1 team: [weren] Discrimination between human-like posts 1 team: [marquardt] and spam-like posts (chatbots) 20

Approaches Features Stylistic features: frequencies of punctuation marks, size of sentences, 7 teams: [mechti][marquardt][ashok] words that appear once and twice, use of [baker][weren][shrestha][liau] deflections, number of characters, words and sentences... Number of posts per user 1 team: [marquardt] Correctness, cleanliness, diversity of texts 1 team: [weren] HTML tags such as img, href, br 2 teams: [weren][marquardt] 21

Approaches Features Readability measures: Automated readability index, Coleman-Liau index, Rix 5 teams: [mechti][marquardt][ashok] Readability Index, Gunning Fog Index, [baker][weren] Flesch-Kinkaid Index... Lexical Analysis: PoS, proper nouns, 2 teams: [mechti][ashok] character flooding... Emoticons 3 teams: [shrestha][marquardt][liau] 22

Approaches Features Content features: n-grams, bag-of-words 3 teams: [villenaroman][shrestha][liau] Topic words: money, home, smartphone... 1 team: [mechti] MRC, LIWC: familiarity, concreteness, 1 team: [marquardt] imagery, motion, emotion, religion... Dictionaries per subcorpus and class, lexical errors, foreign words, specific 4 teams: [baker][marquardt][ashok][liau] phrases: my husband, my wife... 23

Approaches Features Sentiment 1 team: [marquardt] Text to be identified is used as a query for a search engine: cosine similarity, 1 team: [weren] Okapi BM25 Second order representation based on relationships among terms, documents, 1 team: [pastor] profiles and subprofiles 24

Approaches Methods Logistic Regression 1 team: [shrestha][liau][weren] Logic Boost, Rotation Forest, Multi-Class Classifier, Multilayer Perceptron, Simple 1 team: [weren] Logistic Multinomial Naïve Bayes 1 team: [villenaroman] libLINEAR 1 team: [lopezmonroy] Random Forest 1 team: [ashok] Support Vector Machines 1 team: [marquardt] Decision Tables 1 team: [mecthi] Own Frequency-based Prediction 1 team: [baker] Function 25

Early birds (best) results ENGLISH SPANISH CORPUS JOINT GENDER AGE JOINT GENDER AGE SOCIAL liau liau liau shrestha liau liau MEDIA (0.2153) (0.5390) (0.3728) (0.3033) (0.7295) (0.4262) lopezmonroy lopezmonroy 4 teams lopezmonroy marquardt 2 teams BLOG (0.2083) (0.6250) (0.2500) (0.3571) (0.6429) (0.4286) lopezmonroy lopezmonroy lopezmonroy shrestha shrestha shrestha TWITTER (0.5333) (0.7667) (0.6333) (0.6154) (0.8846) (0.6923) HOTEL liau liau lopezmonroy - REVIEWS (0.2622) (0.7317) (0.3720) ‣ 7 teams participated 26

Author Profiling PAN-AP-2014 - CLEF 2014 Sheffield, 15-18 September - PowerPoint PPT Presentation

Author Profiling PAN-AP-2014 - CLEF 2014 Sheffield, 15-18 September 2014 Martin Potthast, Martin Ben Verhoeven, Francisco Rangel Paolo Rosso Irina Chugur Trenkmann, Benno Stein Walter Daelemans Autoritas / Universitat Universitat

Twitter User Profiling: Bot and Gender Identification 7 th Author Profiling Task PAN 2019 CLEF

author profiling shared task on: Bots and gender profiling Francisco Rangel & Paolo Rosso

Author Profiling using Complementary Second Order Attributes and Stylometric Features

CAPS: A Cross-genre Author Profiling System Ivan Bilan and Desislava Zhekova Center for

INAOEs participation at PAN13: Author Profiling task opez-Monroy, M.Sc. 1 A. Pastor L

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , Prasha Shrestha, and Thamar

Author Profiling Cross-genre evaluation PAN-AP-2016 CLEF 2016 vora, 5-8 September Francisco

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Leaving no one behind The role of evidence-building and profiling to include displacement in

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I ntro duc tio n Pre vio us

Profiling Fake News Spreaders on Twitter PAN-AP-2020 CLEF 2020 Online, 22-25 September

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Continuous Profiling in Production: What, Why and How Richard Warburton (@richardwarburto) Sadiq

Production Profiling: What, Why and How Richard Warburton (@richardwarburto) Sadiq Jaffer

Integrating mol Integrating mol ecular Profiling ecular Profiling Into Patient Se election for

Expression Profiling Mark Voorhies 4/3/2012 Mark Voorhies Expression Profiling Its hard

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Provider Profiling Prepared by Melissa Reagan, MSW, LSW, Quality Performance Specialist Agenda

Author Profiling PAN-AP-2014 - CLEF 2014 Sheffield, 15-18 September - PowerPoint PPT Presentation

Author Profiling PAN-AP-2014 - CLEF 2014 Sheffield, 15-18 September 2014 Martin Potthast, Martin Ben Verhoeven, Francisco Rangel Paolo Rosso Irina Chugur Trenkmann, Benno Stein Walter Daelemans Autoritas / Universitat Universitat

Twitter User Profiling: Bot and Gender Identification 7 th Author Profiling Task PAN 2019 CLEF

author profiling shared task on: Bots and gender profiling Francisco Rangel &amp; Paolo Rosso

Author Profiling using Complementary Second Order Attributes and Stylometric Features

CAPS: A Cross-genre Author Profiling System Ivan Bilan and Desislava Zhekova Center for

INAOEs participation at PAN13: Author Profiling task opez-Monroy, M.Sc. 1 A. Pastor L

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , Prasha Shrestha, and Thamar

Author Profiling Cross-genre evaluation PAN-AP-2016 CLEF 2016 vora, 5-8 September Francisco

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Leaving no one behind The role of evidence-building and profiling to include displacement in

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I ntro duc tio n Pre vio us

Profiling Fake News Spreaders on Twitter PAN-AP-2020 CLEF 2020 Online, 22-25 September

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Continuous Profiling in Production: What, Why and How Richard Warburton (@richardwarburto) Sadiq

Production Profiling: What, Why and How Richard Warburton (@richardwarburto) Sadiq Jaffer

Integrating mol Integrating mol ecular Profiling ecular Profiling Into Patient Se election for

Expression Profiling Mark Voorhies 4/3/2012 Mark Voorhies Expression Profiling Its hard

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Provider Profiling Prepared by Melissa Reagan, MSW, LSW, Quality Performance Specialist Agenda

author profiling shared task on: Bots and gender profiling Francisco Rangel & Paolo Rosso