Author Identification Using Semi-supervised Learning Ioannis - PowerPoint PPT Presentation

Author Identification Using Semi-supervised Learning Ioannis Kourtis and Efstathios Stamatatos University of the Aegean

Outline • Introduction • The proposed method – Common n-grams – SVM – Semi-supervised learning • Evaluation – Tuning the model parameters – Results • Conclusions

Author Identification • Authorship attribution vs. authorship verification • Closed-set vs. open-set classification • Text representation – Low-level (e.g., char n-grams) vs. high-level (e.g., syntactic) features • Classification method – Profile-based vs. instance-based paradigm

One Text vs. Groups of Texts • Most author identification methods are based on a fixed and stable training set • There are many cases where we need to decide about the authorship of groups of texts – Alternatively, a long text (a book) of unknown authorship can be segmented into multiple parts • Test sets can be used as unlabeled examples • Semi-supervised learning methods can then be used • Guzman-Cabrera et al. (2009) proposed the use of unlabeled examples found in the Web to enrich the training set

The Proposed Method • We propose a combination of two well-known classification methods – Common n-grams – Support Vector Machines • Both methods are based on character n-gram representation • Test texts are used as unlabeled examples • A semi-supervised learning method enrich the training set • Applied to closed-set classification tasks

 ∑ = − +        ∈ Common n-grams • A profile-based method • Originally proposed by Keselj et al. 2003 • Alternative dissimilarity measure proposed by Stamatatos, 2007 Unseen Training texts text x t1 , x t2 , …, x tn + 2 Dissimilarity 2 ( f ( g ) f ( g )) x 11 , x 12 , …, x 1n , y 1 x T = d ( P ( x ), P ( T )) a function 1 a f ( g ) f ( g ) g P ( x ) x T + a … Author profile + Distance estimation

SVM • Well-known and effective algorithm • Character 3-gram representation • Number of features defined using intrinsic dimension Unseen text x 11 , x 12 , …, x 1d , y 1 x t1 , x t2 , …, x td x 21 , x 22 , …, x 2d , y 2 Learned SVM Model … x m1 , x m2 , …, x md , y m Most likely author Training texts

Comparison • CNG – Robust in class imbalance – Vulnerable when there are many candidate authors – Robust when distribution of training and test sets are not similar • SVM – Vulnerable in class imbalance – Robust when there are multiple candidate authors – Robust when distribution of training and test sets are similar – Better exploitation of very high dimensionality

Semi-supervised Learning Algorithm • Inspired by co-training (Blum & Mitchell, 1998) • Given: – a set of training documents (labeled examples) – a set of test documents (unlabeled examples) • Repeat – Train CNG and SVM models on the training set – Apply CNG and SVM models on the test set – Select test texts that CNG and SVM predictions agree – If text size is larger than a threshold move texts from test to training set • Use SVM as default classifier for the remaining test texts

Comparison with Co-training • Proposed algorithm: – Based on heterogeneous classifiers – Common feature types – Uses cases where the 2 classifiers agree • Co-training: – Based on homogeneous classifiers – Non-overlapping feature sets – Uses cases where the 2 classifiers are most confident

Evaluation Corpora - Small Training corpus Validation corpus 800 600 Texts 400 200 0 Candidate authors • 26 authors • Imbalanced • Similar distribution in training and validation sets

Evaluation Corpora - Large Training corpus Validation corpus 800 600 Texts 400 200 0 Candidate authors • 72 authors • Imbalanced • Similar distribution in training and validation sets

Frequency Threshold (SVM model) Small Large

Text-size Threshold • A threshold of 500 bytes excludes most of the cases where the two models agree but the predicted author is not the correct answer

Settings • Labeled examples: – Training and validation sets • Unlabeled examples: – Test set • CNG – n =3, L =3,000 • SVM – n =3, max intrinsic dimension

Performance MicroAv Corpus MacroAvg MacroAvg MacroAvg g Rank Prec. Recall F1 accuracy Small 0.476 0.374 0.38 0.638 7/17 Large 0.549 0.532 0.52 0.658 1/18

Conclusions • First attempt to apply semi-supervised learning to author identification • Encouraging results for closed-set tasks • Character n-gram representation proves to be very effective • More diversity is needed in the classifier decisions • Plan to extend this approach to open-set tasks

Author Identification Using Semi-supervised Learning Ioannis - PowerPoint PPT Presentation

Author Identification Using Semi-supervised Learning Ioannis Kourtis and Efstathios Stamatatos University of the Aegean Outline Introduction The proposed method Common n-grams SVM Semi-supervised learning Evaluation

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

5 Semi-Supervised Learning BVM Tutorial: Advanced Deep Learning Methods David Zimmerer, Division

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation:

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and

Semi-Supervised Learning Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

10701 Semi supervised learning Can Unlabeled Data improve supervised learning? Important

Parallelizing Semi- ReDAS Lab Supervised Learning Algorithms with MapReduce Nick Gauthier

Five Secrets to Building a Security Culture WWW.ISECOM.ORG Secret #1 You are NOT designed by

The Secrets of Successful Marriages The Secrets of Successful Marriages and All Other

CacheQuote: Efficiently Recovering Long- term Secrets of SGX EPID via Cache Attacks September 5

CSCI 104 Iterators Mark Redekopp David Kempe 2 C++11, 14, 17 Most of what we have taught

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms Avital Oliver* Augustus Odena*

OSv - A Modern Semi-POSIX LibraryOS Glauber Costa, Lead Engineer glommer@cloudius-systems.com

Temporal Dynamics Fabricio Breve fabricio@rc.unesp.br Department of Statistics, Applied

Papers Covered Further Readings Frameworks Chapter 1, Readings in Information Visualization: