Deep Learning and Computational Authorship Attribution for Ancient - PowerPoint PPT Presentation

Deep Learning and Computational Authorship Attribution for Ancient Greek Texts The Case of the Attic Orators Mike Kestemont, Francesco Mambrini & Marco Passarotti Digital Classicist Seminar, Berlin, Germany 16 February 2016

A “Golden Age” of oratory Athens, 5th-4th centuries BCE

Even today! (2003) Our Constitution ... is called a democracy because power is in the hands not of a minority but of the greatest number. Thucydides II, 37

Oratory • Public speech: • court • parliament • ceremonies • Often professional speech writers (‘orators’ >< logographoi ) • Hired by 3rd party (might bring speech themselves) • Often survived in memory first, later written tradition

A canon of 10 names Antiphon ca 480-411 6* Andocides ca 440-390 4* Lysias ca 445-380 35* Isocrates 436-338 21 Isaeus ca 420-350 12 Demosthenes 384-321 61* Aeschines ca 390-322 3 Hyperides ca 390-322 6 Lycurgus ca 390-324 1 Dinarchus ca 360-290 3

Demosthenes Lysias Multiple genres, authenticity issues, professional writers

Why the orators? • large corpus (+600K words) • homogenous chronology, genre and dialect • different personalities • interesting problems of authorship, effect of: • genre • patron • authenticity

Computational Authorship Attribution • Stylometry • How a text is written • Fingerprint • Stylome • Stylistic DNA • (Tendentious)

• Young paradigm (1960) • Mosteller & Wallace (US) • Federalist papers (1780s) • Innovation on 2 levels: • Quantitative approach • Function words

Traditional • Guesswork • Conspicuous features • Odd verbs Mosteller & Wallace • Checklist • But... • Inconspicuous features • schools, workshops, ... • Function words • Tradition • articles (the, it, a) • Forgeries, imitation, ... • prepositions (on, from, to) • … • pronouns (self, he) • (Attic orators rich tradition!)

Advantage? Many observations All authors, same set Relatively content-independent

Count the number of f’s on the following slide...

Finished files are the result of years of scientific study combined with the experience of many years.

How many?

Do we process function words ‘subconsciously’? Finished files are the result of years of scientific study combined with the experience of many years.

Which text is on the following slide?

Difficult to spot errors...

Unimportant?

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

“Functors” • Function words = ‘grammatical morphemes’ • = “functors” in psycholinguistics • In English often individual words • In more inflected languages: often affixes • Easy (naive?) solution: n-grams

N-grams • Intuitive concept: slices of length n • bigrams ( n =2): ‘_b’, ‘bi’, ‘ig’, ‘gr’, ‘ra’, ‘am’, ‘ms’, ‘s_’ • Originally used in language identification • So far, best feature in authorship attribution • Sensitive to morphemic information (e.g. ‘s_’) • ‘Functional’ n-grams are best (incl. punctuation)

character tetragrams (top 100) _ αὐτ - _ γὰρ - _ δʼ _ - _ δὲ _ - _ εἰς - _ κατ - _ καὶ - _ μὲν - _ μὴ _ - _ οὐ _ - _ οὐδ - _ οὐκ - _ παρ - _ περ - _ πολ - _ προ - _ πρὸ - _ πόλ - _ ταῦ - _ τού - _ τοὺ - _ τοῖ - _ τοῦ - _ τὰ _ - _ τὴν - _ τὸ _ - _ τὸν - _ τῆς - _ τῶν - _ τῷ _ - _ ἀλλ - _ ἀπο - _ ἂν _ - _ ἐν _ - _ ἐπι - _ ὡς _ - ίαν _ - ίας _ - αι _ τ - αὐτο - αὶ _ π - αὶ _ τ - γὰρ _ - δὲ _ τ - ειν _ - ερὶ _ - εἰς _ - εῖν _ - θαι _ - ι _ κα - ι _ το - καὶ _ - μένο - μενο - μὲν _ - ν _ αὐ - ν _ εἰ - ν _ κα - ν _ οὐ - ν _ πρ - ν _ το - ν _ ἐπ - ναι _ - νον _ - νος _ - ντα _ - ντας - ντες - ντων - νων _ - οις _ - ους _ - οὐκ _ - οὺς _ - οῖς _ - οῦτο - περὶ - πρὸς - ρὸς _ - ς _ κα - ς _ οὐ - ς _ το - σθαι - σιν _ - ται _ - τας _ - τες _ - τον _ - τος _ - τούτ - τοὺς - τοῖς - τοῦ _ - των _ - τὰς _ - τὴν _ - τὸν _ - τῆς _ - τῶν _ - ὶ _ το

Advances, but many challenges • (Large) benchmark datasets (cf. PAN) • Cross-genre attribution (cf. suicide notes) • Document length (cf. tweets) • Separating content from style: • Function words work well for long texts • Mine stylistic information from content words too

Artificial Intelligence (AI) Reproduce human intelligence in software

Machine Learning • “Learning” is central component of human intelligence • Optimise behaviour, anticipating the future • All applications: map input to output • Huge advances recently, via Deep Learning, a specific paradigm [Lecun et al. 2015]

Deep Learning paradigm Layered neural networks

‘Shallow’ versus ‘Deep’

Computer Vision Importance of layers

Low-level features Used to be ‘handcrafted’!

Higher-level features

Analogies human brain e.g. [Cahieu et al. 2014]

Representation Learning • More ‘objective’ name • Networks learn to represent data • To large extent autonomously • (As opposed to ‘handcrafting’)

Cat paper 10 Million 200x200 images from YouTube (1 week) [Quoc et al. 2012]

Cat paper (2) [Quoc et al. 2012]

How does it work? Chancellor elections C1 C2 C3 F1 F2 F3 F4 F5

Every faculty gets a vote C1 C2 C3 … F1 F2 F3 F4 F5

Votes get weighed Some faculties more important C1 C2 C3 .25 .10.10 .25 .10 .05 .25 .05 … .05 F1 F2 F3 F4 F5

A ‘dense’ layer C1 C2 C3 Dense layer F1 F2 F3 F4 F5

We add layers of ‘representation’ (Student union, professors, … get different weight too) C1 C2 C3 F1 F2 F3 F4 F5 Different sensitivities at different layers Student Profes- Dept. (Students like free beers, union sors Library librarians like free books, …) . . . . . . . . . . . . . . . . . . . . … . . .

Learning = optimising weights (“Lobbying” for a certain candidate) C1 C2 C3 .25 .10.10 .25 .10 .05 .25 .05 … .05 F1 F2 F3 F4 F5

Neural architecture (3 layers) Ten authors Softmax Dense layer Highway layer Dense layer Input features

Networks uncommon in stylometry (Data size?) Burrows’s Delta Support Vector Machine Nearest neighbour Discriminative margin Intuitive ‘Black magic’

Document-Feature matrix ‘Bag of words’ model _ αἰσ _ βασ γενέ ημέν εσθα ες _ ἐ ναῖο ν _ ἀφ Dem1 Dem2 Dem3 Lyc1 … Lyc2 … Ant1 Ant2 E.g. 2000 columns (# MFI)

Experiment • Leave-one-text-out attribution • Non-disputed texts only • (But class imbalance…) • Evaluation: Accuracy, F1 (weighted), F1 (macro- averaged) • Different features + MFI

200 2,000 20,000 Acc F1(w) F1(m) Acc F1(w) F1(m) Acc F1(w) F1(m) Delta 76.04 75.50 50.22 60.00 60.70 34.66 23.20 29.83 15.57 SVM 83.20 81.06 53.21 81.97 77.57 45.27 63.45 52.04 19.01 Net 80.74 79.30 49.68 85.67 83.83 55.60 83.95 81.52 55.92 words 200 2,000 20,000 Acc F1(w) F1(m) Acc F1(w) F1(m) Acc F1(w) F1(m) Delta 76.04 74.33 46.10 82.22 80.76 59.46 50.86 51.03 41.73 SVM 79.75 77.69 48.54 84.44 81.42 54.46 78.02 72.37 39.15 Net 79.50 78.37 0.4816 85.92 84.29 60.41 84.69 81.38 46.38 character tetragrams

Results • Net produces single highest score • But mostly on par with SVM • (Delta does surprisingly well, but never best) • Net: impressive robustness to large input space

Visualization (PCA)

Visualization (Net)

Features

Dem 58 Dem 60 Dem 61 Dem 7 Dem 60 Dem 58

Thank you! Deep Learning and Computational Authorship Attribution for Ancient Greek Texts The Case of the Attic Orators Mike Kestemont, Francesco Mambrini & Marco Passarotti Digital Classicist Seminar, Berlin, Germany 16 February 2016

Deep Learning and Computational Authorship Attribution for Ancient - PowerPoint PPT Presentation

Deep Learning and Computational Authorship Attribution for Ancient Greek Texts The Case of the Attic Orators Mike Kestemont, Francesco Mambrini & Marco Passarotti Digital Classicist Seminar, Berlin, Germany 16 February 2016 A Golden

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

Authorship Attribution of Micro-Messages Roy Schwartz + , Oren Tsur + , Ari Rappoport + and Moshe

Bootstrapped Authorship Attribution in Compression Space Ramon de Graaf Leiden Institute of

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

The Context of the New Testament Brian Criscuolo Del Rey Church Del Rey Bible Institute New

Looking Beyond The Greek Crisis and Lessons for Europe Yannis M. Ioannides Tufts University

Chapter 3 The Science of Astronomy 3.1 The Ancient Roots of Science Our goals for learning:

CONCEPTS AS OBJECTS John McCarthy Computer Science Department jmc@cs.stanford.edu

Mass Refugee Inflow & Human Capital Investments: Evidence from Greek Refugees in Greece Elie

Towards a Decentralized, Trusted, Intelligent and Linked Public Sector: A Report from the Greek

Cuneiform writing had over 560 signs Egyptian hieroglyphics were simpler and used only 20 or 30

John 14:16a (KJV) I will pray [to] the Father, and he shall give you another Comforter.

Deep Learning and Computational Authorship Attribution for Ancient - PowerPoint PPT Presentation

Deep Learning and Computational Authorship Attribution for Ancient Greek Texts The Case of the Attic Orators Mike Kestemont, Francesco Mambrini & Marco Passarotti Digital Classicist Seminar, Berlin, Germany 16 February 2016 A Golden

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Authorship &amp; Publication August 4, 2009 Authorship Publication Authorship Each author

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

Authorship Attribution of Micro-Messages Roy Schwartz + , Oren Tsur + , Ari Rappoport + and Moshe

Bootstrapped Authorship Attribution in Compression Space Ramon de Graaf Leiden Institute of

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

The Context of the New Testament Brian Criscuolo Del Rey Church Del Rey Bible Institute New

Looking Beyond The Greek Crisis and Lessons for Europe Yannis M. Ioannides Tufts University

Chapter 3 The Science of Astronomy 3.1 The Ancient Roots of Science Our goals for learning:

CONCEPTS AS OBJECTS John McCarthy Computer Science Department jmc@cs.stanford.edu

Mass Refugee Inflow &amp; Human Capital Investments: Evidence from Greek Refugees in Greece Elie

Towards a Decentralized, Trusted, Intelligent and Linked Public Sector: A Report from the Greek

Cuneiform writing had over 560 signs Egyptian hieroglyphics were simpler and used only 20 or 30

John 14:16a (KJV) I will pray [to] the Father, and he shall give you another Comforter.

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Mass Refugee Inflow & Human Capital Investments: Evidence from Greek Refugees in Greece Elie