Stopword Graphs and Authorship Attribution in Text Corpora R. Arun, - PowerPoint PPT Presentation

Stopword Graphs and Authorship Attribution in Text Corpora R. Arun, V. Suresh, C. E. Veni Madhavan (2009) 1

Idea • Identify interactions of stopwords (noisewords) in text corpora • View interactions as graphs where stopwords are nodes and interactions weights of edges between stopwords • Interactions defined as distance between pairs of words 2

Idea • Given: List of possible authors, graphs for each autor are computed • i.e. closed case authorship attribution • Authorship of unknown text attributed due to closeness of the graphs • Use Kullback ‐ Leibler ‐ Divergence to compute closeness 3

Stop Words • „Words that convey very little semantic meaning, but help to add detail“ • Stop words similar to function words, but may lists include more words • „Words that convey very little semantic meaning, but help to add detail“ • Defined based on prevalence in text (occupy ~ 50 % of text) • Lists used: 571 stopwords (~480 in my approach) The kids are playing in the garden. 4

Stop Words • „Words that convey very little semantic meaning, but help to add detail“ • Stop words similar to function words, but may lists include more words • „Words that convey very little semantic meaning, but help to add detail“ • Defined based on prevalence in text (occupy ~ 50 % of text) • Lists used: 571 stopwords (~480 in my approach) The kids are playing in the garden. 5

Construction of the Graphs • Stopwords considered as nodes of graphs • Distance captured by edge weights • More weight for stopwords with smaller distances • Distance: Number of words between them Example: The kids are playing in the garden. d(The, the) > d(The, in) > d(the, are) = d(are, in) > d(in, the) w(The, the) < w(The, in) < w(the, are) = w(are, in) < w(in, the) (d: distance function, w: weight function) 6

Construction of the Graphs Example: The kids are playing in the garden. 7

Kullback ‐ Leibler Divergence P, Q discrete probability distributions: Properties: (i) KL(P,Q) is non ‐ negative (ii) KL(P,Q) = 0 iff P = Q a.s. (Proof: Follows directly from Gibb‘s inequality.) 8

Kullback ‐ Leibler Divergence Since KL Divergence is not symmetric, we use: • The more similar P and Q, the smaller KL(P,Q) 9

Calculation of KL Divergence 10

Experiments • 571 stopwords • 10 well ‐ known English authors • Books taken from Project Gutenberg • Training corpus: 50.000 words • Test corpus: 10.000 words • Unclear what texts were used for what purpose… 11

Results 12

Observations/Thoughts • Quality of results influenced largely by training graph • Which training graph should be used (e.g. Twain)? • Change of training graph according to time? • Does it work for other languages? • How well does it work for shorter texts? 13

Own implementation • Python 3.4. • is running (runtime to be improved!) • (or was running before I tried to speed it up…) • Small changes needed • Waiting for more books to be downloaded so I can get more results 14

And finally… • Algorithm fairly easy to reproduce • (even though I had enough issues…) • Blanks could be filled in with some common sense • Clear what to do even though sometimes I would have loved some explanations why… 15

Stopword Graphs and Authorship Attribution in Text Corpora R. Arun, - PowerPoint PPT Presentation

Stopword Graphs and Authorship Attribution in Text Corpora R. Arun, V. Suresh, C. E. Veni Madhavan (2009) 1 Idea Identify interactions of stopwords (noisewords) in text corpora View interactions as graphs where stopwords are nodes and

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Authorship Attribution of Micro-Messages Roy Schwartz + , Oren Tsur + , Ari Rappoport + and Moshe

Bootstrapped Authorship Attribution in Compression Space Ramon de Graaf Leiden Institute of

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

the quality in quantity - enhancing text-based research Bernie cs, National Center for

A brief review of similar attempts .. In the paper Using very deep autoencoders for

An Introduction to XHTM L Print Presented to the W 3C Print Symposium 2006 Dean Anderson,

Advanced Service Worker / PWA with Google Workbox Patrik Bschenstein, Senior Consultant

T h e w o r l d s i n f o r m a t i o n i s n o t d i g i t a l !

ACCESSING TEXT CORPORA AND LEXICAL RESOURCES Accessing Text Corpora Conditional Frequency

Dealing With Big Data Outside Of The Cloud GPU Accelerated Sort John Vidler 1 Paul Rayson 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Sambuz

Useful Links

Newsletter

Mail Us

Stopword Graphs and Authorship Attribution in Text Corpora R. Arun, - PowerPoint PPT Presentation

Stopword Graphs and Authorship Attribution in Text Corpora R. Arun, V. Suresh, C. E. Veni Madhavan (2009) 1 Idea Identify interactions of stopwords (noisewords) in text corpora View interactions as graphs where stopwords are nodes and

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Authorship &amp; Publication August 4, 2009 Authorship Publication Authorship Each author

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Authorship Attribution of Micro-Messages Roy Schwartz + , Oren Tsur + , Ari Rappoport + and Moshe

Bootstrapped Authorship Attribution in Compression Space Ramon de Graaf Leiden Institute of

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

the quality in quantity - enhancing text-based research Bernie cs, National Center for

A brief review of similar attempts .. In the paper Using very deep autoencoders for

An Introduction to XHTM L Print Presented to the W 3C Print Symposium 2006 Dean Anderson,

Advanced Service Worker / PWA with Google Workbox Patrik Bschenstein, Senior Consultant

T h e w o r l d s i n f o r m a t i o n i s n o t d i g i t a l !

ACCESSING TEXT CORPORA AND LEXICAL RESOURCES Accessing Text Corpora Conditional Frequency

Dealing With Big Data Outside Of The Cloud GPU Accelerated Sort John Vidler 1 Paul Rayson 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Sambuz

Useful Links

Newsletter

Mail Us

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author