Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi cation Roman Kern 1 , 2 Stefan Klamp fl 2 Mario Zechner 2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center rkern@tugraz.at {rkern, sklamp fl , mzechner}@know-center.at PAN Workshop @ CLEF 2012 / 2012-09-20 Graz University of Technology
Authorship Attribution - Approach Graz University of Technology Vote/Veto Classi fi cation ◮ Same as last year ⇒ Compare data-sets ◮ Three di ff erent feature-set sets ⇒ Compare in fl uence of uni-grams features vs. stylometric features 2 / 15
Authorship Attribution - Classi fi cation Graz University of Technology Classi fi cation Algorithm ◮ Combine feature-spaces via individual base classi fi ers ◮ Based on performance in training phase ◮ In classi fi cation phase combine results Base Feature Spaces ◮ Basic statistics, token statistics, grammar statistics ◮ Stop-word terms, slang terms, pronoun terms ◮ Intro-outro terms, bigram terms, unigram terms, terms Feature Space Combinations ◮ Terms ◮ Stylometric ◮ Statistics 3 / 15
Authorship Attribution - Data-Sets Graz University of Technology Basic Statistics PAN 2011 PAN 2012 1 Paragraph to lines ratio Number of characters 2 Text to lines ratio Number of words 3 Number of lines Number of lines 4 Empty lines ratio Number of stop-words 5 Number of paragraphs Number of tokens Token Statistics PAN 2011 PAN 2012 1 Likelihood of proper nouns Number of tokens 2 Number of tokens Likelihood of proper nouns 3 Average token length Average verb length 4 Likelihood of infrequent word groups Average token length 5 Likelihood of tokens of length 9 Likelihood of pronouns 4 / 15
Authorship Attribution - Feature Types Graz University of Technology Comparison of con fi gurations 70 Terms Statistics Stylometric 60 50 40 30 20 10 0 Terms Statistics Stylometric 5 / 15
Authorship Clustering - Approach Graz University of Technology Ensemble Clustering ◮ Multi-tier clustering ◮ Combine output of base clusters ◮ Only use stylometric features Ensemble clustering is also known as consensus clustering or clustering aggregation 6 / 15
Authorship Clustering - Features Graz University of Technology Multiple feature spaces ◮ Basic statistics (same as for authorship attribution) ◮ Stylometric features (hapax-legomena, hapax-dislegomena, yules-k, simpsons-d, brunets-w, sichels-s, honores-h, ...) ◮ Stem-suffixes, stop-words, pronouns ◮ Character 1-grams, 2-grams, 3-grams ⇒ Total of 7 feature spaces 7 / 15
Authorship Clustering - Clustering Graz University of Technology Base clustering ◮ k-means clustering ◮ k-means++ seed selection ◮ Di ff erent relatedness measures for di ff erent feature spaces ◮ Cosine similarity ◮ Euclidean distance (after normalising the features) Ensemble clustering ◮ Create a meta-space from the individual clustering solution ◮ In meta-space the distance between instances depends on the agreement of the clustering solutions ◮ Give di ff erent base clusters di ff erent weight ◮ k-means clustering 8 / 15
Authorship Clustering - Evaluation Graz University of Technology Ensemble clustering results Feature Space A vs B C vs D E vs F 1-grams 51.52% 53.98% 61.87% 2-grams 50.91% 54.46% 56.70% 3-grams 50.91% 51.33% 52.37% Stop-Words & Pronouns 62.20% 50.72% 72.91% Stem Suffices 65.85% 63.01% 54.61% Stylometry 52.74% 59.76% 64.25% Basic Statistics 57.01% 56.87% 65.22% Ensemble 66.10% 80.34% 78.44% 9 / 15
Sexual Predator Identi fi cation - Approach Graz University of Technology Sequence classi fi cation ◮ Not directly classify predators ◮ Classify individual messages/line in chats ◮ Simple features 10 / 15
Sexual Predator Identi fi cation - Classes Graz University of Technology Chat message classes/labels ◮ normal, predator; o ff ending; reaction, post-o ff ending C h a t # 1 C h a t # 2 1 n o r m a l 1 n o r m a l e 2 p r e d a t o r 2 p r e d a t o r r p 3 n o r m a l 3 n o r m a l 4 n o r m a l 4 n o r m a l 5 o f f e n d i n g 5 p r e d a t o r 6 r e a c t i o n 6 n o r m a l t 7 p o s t - o f f e n d i n g s o 7 p r e d a t o r p 8 p o s t - o f f e n d i n g 8 p r e d a t o r 9 r e a c t i o n 9 n o r m a l 1 0 r e a c t i o n 11 / 15
Sexual Predator Identi fi cation - Features Graz University of Technology Simple features ◮ Unigrams ◮ Double Metaphone ◮ ✐s■♥✐t✐❛❧❆✉t❤♦r , ✐s▲❛st❆✉t❤♦r , ✐s▼♦st❱❡r❜♦s❡❆✉t❤♦r , ✐s❋❡✇❡r❆✉t❤♦rs , ❤❛s❚❡r♠❋r♦♠Pr❡✈✐♦✉s Classi fi cation algorithm ◮ Maximum entropy & beam search 12 / 15
Sexual Predator Identi fi cation - Training Graz University of Technology 13 / 15
Sexual Predator Identi fi cation - Results Graz University of Technology Class Count Precision Recall normal 3,117 0.955 0.995 predator 29 0.3 0.103 o ff ending 52 0 0 post-o ff ending 216 0.871 0.847 reaction 275 0.959 0.764 Identify predators 2 0.667 1 14 / 15
The End Graz University of Technology Thank you! Open-source code ❤tt♣s✿✴✴✇✇✇✳❦♥♦✇♠✐♥❡r✳❛t✴s✈♥✴ ♦♣❡♥s♦✉r❝❡✴♣r♦❥❡❝ts✴♣❛♥✷✵✶✷✴tr✉♥❦ Corresponding Author Roman Kern ❁r❦❡r♥❅t✉❣r❛③✳❛t❃ 15 / 15
Recommend
More recommend