Authorship identification in large email collections: Experiments using features that belong to different linguistic levels George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens
2 PAN 2011 Lab, 19-22 September 2011, Amsterdam Style • Our approach to authorship identification is based mainly on the idea that an author’s style is a complex multifaceted phenomenon affecting the whole spectrum of his/her linguistic production. • Following the old theoretical notion of “double articulation” of the Prague School of Linguistics we accept that stylistic information is constructed in parallel blocks of increasing semantic load, from character n- grams, to word n-grams. • In order to capture the multilevel manifestation of stylistic traits we should detect these features, which belong to many different linguistic levels, and utterly combine them for achieving the most accurate representation of an author’s style.
3 PAN 2011 Lab, 19-22 September 2011, Amsterdam An hierarchical representation of features and related linguistic levels Semantics Word trigrams Word bigrams Syntax Word unigrams Morphology Character trigrams Character Phonology bigrams
4 PAN 2011 Lab, 19-22 September 2011, Amsterdam Features 1000 most frequent n-grams from the following feature groups: • Character Bigrams (cbg) : Character n-grams provide a robust indicator of authorship and many studies have confirmed their superiority in large datasets. • Character Trigrams (ctg) : Character trigrams capture significant amount of stylistic information and have the additional merit that they also represent common email acronyms like FYI, FAQ, BTW, etc. • Word Unigrams (ung) : Word frequency is considered among the oldest and most reliable indicators of authorship outperforming sometimes even the n-gram features. • Word Bigrams (wbg) : Word bigrams have long been used in authorship attribution with success. • Word Trigrams (wtg) : Word trigrams have also been found to convey useful stylistic information since they approach more closely the syntactic structure of the document.
5 PAN 2011 Lab, 19-22 September 2011, Amsterdam Algorithms and Datasets • Large and Small Datasets (Authorship Attribution scenario) ▫ L2 Regularized Logistic Regression (Authorship Attribution tasks) • Large and Small + Datasets (Combined Authorship Attribution and Verification scenario) ▫ One-Class SVM and L2 Regularized Logistic Regression • Verify 1, 2 & 3 Datasets (Pure Author Verification) ▫ One-Class SVM (Authorship Verification tasks) using only the 2000 most frequent character bigrams.
6 PAN 2011 Lab, 19-22 September 2011, Amsterdam Results in Large Train Dataset 0,6 0,5 0,481 0,465 0,4 0,32 0,322 0,312 0,311 0,303 0,293 0,3 0,281 0,26 0,256 0,246 0,2 0,1 0 Cbg Wtg Ctg Wbg Ung All Acc F1
7 PAN 2011 Lab, 19-22 September 2011, Amsterdam F 1 in Large Test Dataset 0,7 0,658 0,642 0,594 0,594 0,6 0,571 0,519 0,522 0,508 0,5 0,5 0,428 0,4 0,3 0,238 0,255 0,221 0,219 0,2 0,148 0,1 0,035 0,055 0 0
8 PAN 2011 Lab, 19-22 September 2011, Amsterdam Results in Small Train Dataset 0,8 0,683 0,7 0,662 0,59 0,6 0,576 0,568 0,551 0,519 0,502 0,49 0,5 0,472 0,423 0,407 0,4 0,3 0,2 0,1 0 Cbg Wtg Ctg Wbg Ung All Acc F1
9 PAN 2011 Lab, 19-22 September 2011, Amsterdam F 1 in Small Test Dataset 0,8 0,717 0,717 0,709 0,685 0,7 0,659 0,642 0,638 0,629 0,62 0,6 0,5 0,44 0,432 0,4 0,374 0,372 0,311 0,3 0,232 0,2 0,091 0,1 0 0
10 PAN 2011 Lab, 19-22 September 2011, Amsterdam Procedure in Large & Small + Datasets Unknown Author Author Dataset with 1 Unknown Authors Unknown Authors L2 Author Regularized 2 Logistic Author Known Regression 3 Dataset with Known Authors Authors Author … Author n One-Class SVM
11 PAN 2011 Lab, 19-22 September 2011, Amsterdam F 1 in Large & Small + 0,7 Large+ 0,587 0,6 0,492 0,518 0,5 0,451 0,446 0,368 0,4 0,369 0,3 0,222 0,216 0,201 0,175 0,2 Small+ 0,1 0,037 0,001 0,7 0 0,588 0,575 0,6 0,527 0,5 0,377 0,4 0,349 0,303 0,331 0,301 0,3 0,254 0,189 0,173 0,2 0,1 0,065 0 0
12 PAN 2011 Lab, 19-22 September 2011, Amsterdam Results in Verification datasets 0,8 0,7 0,667 0,6 0,6 0,5 0,5 0,4 0,3 0,2 0,125 0,1 0,035 0,036 0 Verify1 Verify2 Verify3 Precision Recall
13 PAN 2011 Lab, 19-22 September 2011, Amsterdam Conclusions • Features spanning in multiple linguistic levels capture better author’s stylistic variation than features that focus in a specific level. • L2 Regularized Logistic Regression performs very well in high dimensional data. • Authorship verification research remains a difficult problem and research should be focused to new algorithms handling one-class problems. • We need one / many common benchmark corpus/corpora in order to further advance authorship identification tools and methods.
Recommend
More recommend