Feature Bagging for Author Attribution PAN - CLEF 2012 - PowerPoint PPT Presentation

Feature Bagging for Author Attribution PAN - CLEF 2012 François-Marie Giraud / Thierry Artières LIP6 – University Paris 6 - France

Motivation • From the littérature on author attribution – Hard to beat a simple and efficient system Linear SVM on bag of features • Hypothetical explanations – Intrinsic difficulty to define relevant stylistic features • Stylistic individual features are embedded and hidden in a large amount of features • Stylistic features depend on the writer – Optimization concern • Undertraining phenomenon [McCallum et al., CIIR 2005]

Motivation • Undertraining phenomenon Training Document set: Bag of features (words sorted most to less frequent)

Motivation • Undertraining phenomenon Training Document set: Bag of features (words sorted most to less frequent) • Red subset of feature alone allows perfect training set Linear SVM discrimination • Blue subset of feature alone allows either • Green subset is Discrimination based useless on red features only

Motivation • Undertraining phenomenon Linear SVM 0 0 0 0 Test Document containing no RED features. Near random prediction

Undertraining investigation Document: Bag of 2500 features (words sorted most to less frequent) First X Training accuracy Validation accuracy

Undertraining investigation Document: Bag of 2500 features (words sorted most to less frequent) All but X first Training accuracy Validation accuracy

Undertraining investigation Document: Bag of 2500 features (words sorted most -> less frequent) Random X Training accuracy Validation accuracy

Principle of Document feature Bag of ~ 3000 most frequents words bagging Bag of words Random selection of 50 to 200 features … K base classifiers learned on random subsets of features Majority vote Base classifiers results aggregation Prediction Author

Preliminary results English public available blog corpus Statistics on Base classifiers Comparison with Baseline

Experimental methodology for PAN Train A1 A2 Learning B1 B2 C1 C2 stage Train valid A1 A2 B1 B2 C1 C2

Experimental methodology for PAN Train A1 A2 Learning B1 B2 C1 C2 stage Train valid Train valid Train valid Train valid A1 A2 A2 A1 A1 A2 A1 A2 B1 B2 B1 B2 B2 B1 B1 B2 C1 C2 C1 C2 C1 C2 C2 C1

Comments on PAN results • Less random features works better. • Better ranks on closed tasks • Reject method have to be improved • Interest to use severals training/validation split

Perspective : A two Stage Approach • Motivation – The way the classifier behaves when removing features depends on the author [Koppel 2007] Author profiles for unmasking method, [Koppel 2007] • Investigate mixing – this result with – our feature bagging approach

Two Stage Approach 1. Bagging Appproach Learn multiple base classifiers erxploiting random selected subsets of features. Profile vector for document d and author a 2. Building new data (called profile) for each pair (document, author) 3. (Optional) sort all vectors of the new dataset according to. 4. Learn a binary classifier to say if a profile is correct or not

Two Stage Approach True author (sorted) profiles 1. Bagging Appproach Learn multiple base classifiers erxploiting random selected subsets of value features. 2. Building new data (called profile) for each pair (document, author) Feature False author profiles 3. (Optional) sort all vectors of the new dataset according to. 4. Learn a binary classifier to say if a profile is correct or not Similar results as Bagging approach

Conclusion and further works • Feature bagging approach to enforce exploiting all features ⇒ Outperforms the SVM baseline ⇒ Should be improved for handling open problems (cf PAN results) • Similar results of the second approach • While different representation ⇒ Should be combined

ANY QUESTION ?

Additional results on PAN

Feature Bagging for Author Attribution PAN - CLEF 2012 - PowerPoint PPT Presentation

Feature Bagging for Author Attribution PAN - CLEF 2012 Franois-Marie Giraud / Thierry Artires LIP6 University Paris 6 - France Motivation From the littrature on author attribution Hard to beat a simple and efficient system

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

CS489/698 Lecture 22: March 27, 2017 Bagging and Distributed Computing [RN] Sec. 18.10, [M] Sec.

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

!= AUTHOR ORIT ITY AUTHOR ORIT ITY LIKING NG AUTHOR ORIT ITY LIKING NG SOCIA

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Feature Structures, Unification Some grammatical phenomena Linguistic features Feature

CS 162 Intro to Computer Science II Separate Compila4on 1

ethics in NLP CS 685, Fall 2020 Introduction to Natural Language Processing

CS CS 683 683 - Se Securi rity and Pri rivacy Fall 2019 Fa Instr Ins truc uctor: Ka

Fundamentals of Programming Session 10 Instructor: Reza Entezari-Maleki Email:

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Lecture #26:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Lecture IX:

Cloud Mobile Computing Ed Crowley Tonights Topics Communicate class expectations

Source: Prof. Scott Eberhardt AE-714_332M Aircraft Design Capsule-0 Cost is the Key ! Needs