Proximity based one-class classification with Common N-Gram - PowerPoint PPT Presentation

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task PAN 2013 Author Identification Magdalena Jankowska, Vlado Kešelj and Evangelos Milios Faculty of Computer Science, Dalhousie University, Halifax, Canada PAN Workshop, CLEF 2013, Valencia, September 25, 2013

Authorship verification problem Input: Set of “known” documents by a given author 𝑩 document of a questioned authorship “unknown” 𝑣 document

Authorship verification problem Input: Set of “known” documents by a given author 𝑩 document of a questioned authorship “unknown” 𝑣 document Question: Was u written by the same author?

Our approach to the authorship verification problem • Proximity-based one-class classification . Is u “similar enough” to A ? • Idea similar to the k-centres method for one-class classification • Applying CNG dissimilarity between documents 𝑩 document of a questioned authorship “unknown” 𝑣 document

Common N-Gram (CNG) dissimilarity Proposed by Vlado Ke š elj, Fuchun Peng, Nick Cercone, and Calvin Thomas . N-gram-based author profiles for authorship attribution . In Proc. of the Conference Pacific Association for Computational Linguistics, 2003. Proposed as a dissimilarity measure of the Common N-Gram (CNG) classifier for multi-class classification ? the least dissimilar class works of Carroll works of Shakespeare works of Twain Successfully applied to the authorship attribution problem

CNG dissimilarity - formula Profile a sequence of L most common n-grams of a given length n

CNG dissimilarity - formula Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6 document 1: Alice's Adventures in the Wonderland by Lewis Carroll profile 𝑸 𝟐 n-gram normalized frequency 𝐠 𝟐 _ t h e 0.0127 t h e _ 0.0098 a n d _ 0.0052 _ a n d 0.0049 i n g _ 0.0047 _ t o _ 0.0044

CNG dissimilarity - formula Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6 document 1: document 2: Alice's Adventures in the Wonderland Tarzan of the Apes by Lewis Carroll by Edgar Rice Burroughs profile 𝑸 𝟐 profile 𝑸 𝟑 n-gram normalized n-gram normalized frequency frequency 𝐠 𝟐 𝐠 𝟑 _ t h e 0.0127 _ t h e 0.0148 t h e _ 0.0098 t h e _ 0.0115 a n d _ 0.0052 a n d _ 0.0053 _ a n d 0.0049 _ o f _ 0.0052 i n g _ 0.0047 _ a n d 0.0052 _ t o _ 0.0044 i n g _ 0.0040

CNG dissimilarity - formula Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6 document 1: document 2: Alice's Adventures in the Wonderland Tarzan of the Apes by Lewis Carroll by Edgar Rice Burroughs profile 𝑸 𝟐 profile 𝑸 𝟑 CNG dissimilarity between these documents n-gram normalized n-gram normalized frequency frequency 2 𝑔 1 𝑦 − 𝑔 2 𝑦 𝐠 𝟐 𝐠 𝟑 𝐸 = 𝑔 1 𝑦 + 𝑔 2 𝑦 _ t h e 0.0127 _ t h e 0.0148 𝑦∈𝑄 1 ∪𝑄 2 2 t h e _ 0.0098 t h e _ 0.0115 a n d _ 0.0052 where a n d _ 0.0053 𝑔 𝑗 𝑦 = 0 _ a n d 0.0049 _ o f _ 0.0052 if 𝑦 does not appear in 𝑄 𝑗 i n g _ 0.0047 _ a n d 0.0052 _ t o _ 0.0044 i n g _ 0.0040

Proximity-based one-class classification: dissimilarity between instances Set of “known” documents by a given author Dissimilarity between 𝑩 a given “known” document 𝑒 𝑗 and the “unknown” document 𝑬 𝒆 𝒋 , 𝒗 “unknown” 𝑣 document

Proximity-based one-class classification: dissimilarity between instances Set of “known” documents by a given author Dissimilarity between 𝑩 a given “known” document 𝑒 𝑗 and the “unknown” document 𝑬 𝒆 𝒋 , 𝒗 Maximum dissimilarity between 𝑒 𝑗 a nd any “known” document “unknown” 𝑣 𝑬 𝒏𝒃𝒚 𝒆 𝒋 , 𝑩 document this author’s document most dissimilar to 𝑒 𝑗

Proximity-based one-class classification: dissimilarity between instances Dissimilarity ratio of 𝒆 𝒋 : How much more/less dissimilar is the “unknown” document than the most dissimilar document by the same author. 𝑬 𝒆 𝒋 , 𝒗 𝒔 𝒆 𝒋 , 𝒗, 𝑩 = 𝑩 𝑬 𝒏𝒃𝒚 𝒆 𝒋 , 𝑩 𝑒 𝑗 𝑬 𝒆 𝒋 , 𝒗 𝑣 𝑬 𝒏𝒃𝒚 𝒆 𝒋 , 𝑩 this author’s document most dissimilar to 𝑒 𝑗

Proximity-based one-class classification: proximity between a sample and the positive class instances Measure of proximity between the “unknown” document and the set A of documents by a given author: 𝑵 𝒗, 𝑩 - average of 𝑩 dissimilarity ratios 𝑠 𝑒 𝑗 , 𝑣, 𝐵 over all “known” documents 𝑒 𝑗 𝑵 𝒗, 𝑩 “unknown” 𝑣 document

Proximity-based one-class classification: thresholding on the proximity Iff 𝑵 𝒗, 𝑩 less than or equal to a threshold θ : classify u as belonging to A i.e., written by the same author 𝑵 𝒗, 𝑩 - average of 𝑩 dissimilarity ratios 𝑠 𝑒 𝑗 , 𝑣, 𝐵 over all “known” documents 𝑒 𝑗 𝑵 𝒗, 𝑩 “unknown” 𝑣 document

Real scores Obtained by linear scaling the 𝑁 𝑣, 𝐵 measure: the threshold 𝜄  0.5 with cut-off at 𝜄 ± 0.1 : 𝑁 𝑣, 𝐵 < 𝜄 − 0.1  1 𝑁 𝑣, 𝐵 > 𝜄 + 0.1  0

Special conditions used • Dealing with instances when only 1 “known” document by a given author is provided: dividing the single “known” document into two halves and treating them as two “known” documents • Dealing with instances when some documents do not have enough character n-grams to create a profile of a chosen length : representing all documents in the instance by equal profiles of the maximum length for which it is possible • Additional preprocessing (tends to increase accuracy on training data): cutting all documents in a given instance to an equal length in words

Parameters Parameters of our method: Type of tokens: we used characters n – n-gram length L – profile length θ – threshold for the proximity measure M for classification (biggest problem)

Parameter selection Parameters for the final competition run selected using experiments on training data in Greek and English: • provided by the competition organizers • compiled by ourselves from existing datasets for other authorship attribution problems For Spanish: the same parameters as for English English Greek Spanish n (length of character n-grams) 6 7 L (profile length) 2000 2000 θ (threshold) if at least two “known” documents given 1.02 1.008 θ (threshold) if only one “known” document given 1.06 1.04

Results on PAN 2013 competition test dataset Entire English Greek Spanish set subset subset subset F 1 of our method 0.659 0.733 0.600 0.640 competition rank 5 th (shared) 5 th (shared) 7 th (shared) 9th of 18 of 18 of 16 of 16 best F 1 of other 0.753 0.800 0.833 0.840 competitors AOC 0.777 0.842 0.711 0.804

Conclusion • Very encouraging results in terms of the power of our measure M for ordering the instances • Difficult choice of the threshold, depending much on the corpus

Future work • Further parameter analysis • Exploration of involving a user interaction and insight through visualization • Exploration of improvements of the method

Acknowledgement • This research was funded by a contract from the Boeing Company, a Collaborative Research and Development grant from the Natural Sciences and Engineering Research Council of Canada, and Killam Predoctoral Scholarship.

Thank you!

Proximity based one-class classification with Common N-Gram - PowerPoint PPT Presentation

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task PAN 2013 Author Identification Magdalena Jankowska, Vlado Keelj and Evangelos Milios Faculty of Computer Science, Dalhousie University,

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Planar Delaunay Triangulations and Proximity Structures Proximity Structures Given: a set P of n

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Close Proximity Radiography www.tracoilandgas.com Overview What is Close Proximity Radiography?

Replay, Relay and Inverse-Sybil Attacks on Proximity Tracing Apps Krzysztof Pietrzak 2020

Behavioral Detection and Containment of Proximity Malware in Delay Tolerant Networks Wei Peng,

#prep X Assembly 03-B: Proximity Sensor + Right Fan You got the Dual Fan Upgrade? This is what

The distribution of the proximity function Timm Oertel Joseph Paat + Robert Weismantel +

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

51 52 Proximity-based social networks. Talking to strangers

Proximity-based Outlier Detection Objects far away from the others are outliers The

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Towards Proximity Graph Auto-Configuration: an Approach Based on Meta-learning Rafael S. Oyamada,

Bag-of-features models for category classification for category classification Cordelia Schmid

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Quantale-valued dissimilarity Lili Shen (joint with Hongliang Lai, Yuanye Tao and Dexue Zhang)

Making A Many-Colored Processing Engine: Signal Processing with Optical Filters Christi K. Madsen

Numerical dispersion and Linearized Saint-Venant Equations M. Ersoy Basque Center for Applied

Is this NE tagger getting old? Language Resources and Evaluation Conference Marrakech, Morocco -

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl & Bernhard

How to Optimize Gower Distance Weights for the k-Medoids Clustering Algorithm to Obtain Mobility

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

Proximity based one-class classification with Common N-Gram - PowerPoint PPT Presentation

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task PAN 2013 Author Identification Magdalena Jankowska, Vlado Keelj and Evangelos Milios Faculty of Computer Science, Dalhousie University,

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Planar Delaunay Triangulations and Proximity Structures Proximity Structures Given: a set P of n

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

Close Proximity Radiography www.tracoilandgas.com Overview What is Close Proximity Radiography?

Replay, Relay and Inverse-Sybil Attacks on Proximity Tracing Apps Krzysztof Pietrzak 2020

Behavioral Detection and Containment of Proximity Malware in Delay Tolerant Networks Wei Peng,

#prep X Assembly 03-B: Proximity Sensor + Right Fan You got the Dual Fan Upgrade? This is what

The distribution of the proximity function Timm Oertel Joseph Paat + Robert Weismantel +

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

51 52 Proximity-based social networks. Talking to strangers

Proximity-based Outlier Detection Objects far away from the others are outliers The

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Towards Proximity Graph Auto-Configuration: an Approach Based on Meta-learning Rafael S. Oyamada,

Bag-of-features models for category classification for category classification Cordelia Schmid

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Quantale-valued dissimilarity Lili Shen (joint with Hongliang Lai, Yuanye Tao and Dexue Zhang)

Making A Many-Colored Processing Engine: Signal Processing with Optical Filters Christi K. Madsen

Numerical dispersion and Linearized Saint-Venant Equations M. Ersoy Basque Center for Applied

Is this NE tagger getting old? Language Resources and Evaluation Conference Marrakech, Morocco -

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl &amp; Bernhard

How to Optimize Gower Distance Weights for the k-Medoids Clustering Algorithm to Obtain Mobility

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

Fast and Accurate Distance Computation from Unaligned Genomes Fabian Kltzl & Bernhard