D Exploring the internal heterogeneity of a corpus of Classical - PowerPoint PPT Presentation

D Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo Bryan Jurish Annette Gerstenberg Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit¨ at Berlin jurish@bbaw.de annette.gerstenberg@fu-berlin.de Global Philology Open Conference Universit¨ at Leipzig 22 nd February, 2016 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

Overview The Situation p Diachronic (heterogeneous) Text Corpora p Collocation Profiling p Diachronic Collocation Profiling DiaCollo p Requests & Parameters p Profile, Diffs & Indices APWCF Corpus p Background p Subcorpora p Sources & Enrichments Examples Summary & Conclusion 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

The Situation: Diachronic Text Corpora p heterogeneous text collections, especially with respect to date of origin t other partitionings may be relevant too, e.g. by genre, location , etc. p increasing number available for linguistic & humanities research, e.g. t Deutsches Textarchiv (DTA) (Geyken 2013) t Referenzkorpus Altdeutsch (DDD) (Richling 2011) t Corpus of Historical American English (COHA) (Davies 2012) p . . . but even putatively “synchronic” corpora have a temporal extension, e.g. t DWDS/ZEIT ( Kohl ∼ politician vs. “cabbage”) (1946–2016) t DDR Presseportal ( Ausreise ∼ “departure”) (1945–1993) t DWDS/Blogs (“Browser”) (1994–2016) p should expose temporal effects of e.g. semantic shift , discourse trends p problematic for conventional natural language processing tools t implicit assumptions of homogeneity 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [2]

The Situation: Collocation Profiling “You shall know a word by the company it keeps” — J. R. Firth Basic Idea (Church & Hanks 1990; Manning & Sch¨ utze 1999; Evert 2005) p lookup all candidate collocates ( w 2 ) occurring with the target term ( w 1 ) p rank candidates by association score t “chance” co-occurrences with high-frequency items must be filtered out ! t statistical methods require large data sample What for? p computational lexicography (Kilgarriff & Tugwell 2002; Didakowski & Geyken 2013) p neologism detection (Kilgarriff et al. 2015) p distributional semantics (Sch¨ utze 1992; Sahlgren 2006) p “text mining” / “distant reading” (Heyer et al. 2006; Moretti 2013) 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [3]

Diachronic Collocation Profiling The Problem: (temporal) heterogeneity p conventional collocation extractors assume corpus homogeneity p co-occurrence frequencies are computed only for word-pairs ( w 1 , w 2 ) p influence of occurrence date (and other document properties) is irrevocably lost A Solution (sketch) p represent terms as n -tuples of independent attributes (including occurrence date) t alternative: “document” level co-occurrences over sparse TDF matrix p partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”) p collect independent slice-wise profiles into final result set Advantages Drawbacks t full support for diachronic axis t sparse data requires larger corpora t variable query-level granularity t computationally expensive t flexible attribute selection t large index size t multiple association scores t no syntactic relations (yet) 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [4]

DiaCollo: Requests & Parameters p Perl API, RESTful web-service (Fielding 2000) + web-form GUI p accepts user requests as set of parameter=value pairs p parameter passing via URL query string or HTTP POST request p common parameters: Parameter Description target collocant lemma(ta), regular expression, or DDC query query target date(s), interval, or regular expression date epoch granularity or “0” (zero) for a date-independent profile slice groupby projected collocate attributes with optional restrictions association score function for collocate ranking score maximum number of collocate items to return per epoch kbest binary score comparison operation for diff profiles diff global request global profile pruning (vs. default epoch-local pruning) profile type to be computed ( { native,tdf,ddc } × { unary,diff } ) profile output format or visualization mode (e.g. TSV, JSON, HTML, d3-cloud, . . . ) format 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [5]

DiaCollo: Profiles, Diffs & Indices Profiles & Diffs p simple request → unary profile for target term(s) ( profile , query ) t filtered & projected to selected attribute(s) ( groupby ) t trimmed to k -best collocates for target word(s) ( score , kbest , global ) t aggregated into independent epoch-wise sub-intervals ( date , slice ) p diff request → comparison of two independent targets ( profile , bquery , . . . ) t highlights differences or similarities of target queries ( diff ) t can be used to compare different words ( query � = bquery ) . . . or different corpus subsets w.r.t. a given word (e.g. date � = bdate ) Indices & Attributes p compile-time filtering of native indices: frequency threshholds, PoS-tags p default index attributes: Lemma (l) , Pos (p) p finer-grained queries possible with TDF or DDC back-ends p batteries not included : corpus preprocessing, analysis, & full-text search index t see e.g. Jurish (2003); Geyken & Hanneforth (2006); Jurish et al. (2014), . . . 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [6]

DiaCollo: Scoring & Comparison Functions Selected Association Score Functions p f raw collocation frequency = f 12 p lf collocation log-frequency = log 2 ( f 12 + ε ) f 12 × N p mi1 pointwise mutual information ≈ log 2 f 1 × f 2 f 12 × N p milf pointwise MI × log-frequency ≈ log 2 f 1 × f 2 × log 2 f 12 p ll log-likelihood (Dunning 1993) ≈ sgn( f 12 | f 1 , f 2 ) × log(1 + log λ ) 2 × f 12 p ld log-Dice coefficient (Rychl´ y 2008) ≈ 14 + log 2 f 1 + f 2 Selected Diff Operations p diff raw score difference = s a − s b p adiff absolute score difference = | s a − s b | = 1 p avg arithmetic average 2 ( s a + s b ) p max maximum = max { s a , s b } p min minimum = min { s a , s b } ≈ 2 × s a × s b p havg harmonic average s a + s b 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [7]

APWCF: From diplomatic correspondence to a corpus of Classical French p Classical French : same structures as modern French, but t linguistic norm is not yet stable t variation and patterns of usage of grammatical features t linguistic change on the levels of semantics and pragmatics p Acta Pacis Westphalicae (1643–1648): The French correspondence t Diplomatic letters between Paris (government) and diplomats at M¨ unster t Ambassadors are committed to achieving diplomatic goals � convincing the government to adapt the instructions t Diplomatic letters: formal constraints versus expressive needs p Linguistic interest t Diachronic variation : comparison with existing resources of Classical French t Synchronic variation : genre-internal heterogeneity p Genre-internal heterogeneity : hypothesis of different levels of formality t Two subcorpora: ”government” (Paris) and ”ambassadors” (M¨ unster) t Register-variation reflected in the use of linguistic variables 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [8]

APWCF: Correspondence 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [9]

APWCF: Subcorpora 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [10]

APWCF: Data p Letters mostly conserved in French Archives t Archives du Minist` ere des Affaires Etrang` eres t microforms: Zentrum f¨ ur Historische Forschung, Bonn p Digital edition (PDF, XML) t Bayerische Staatsbibliothek t Zentrum f¨ ur historische Forschung Bonn p Linguistic corpus: AG t Part-of-Speech Tagging (PRESTO, Cologne/Lyon) t XML / TXM (Lyon) p Corpus size: 8 volumes of French edition, 2.4M Tokens 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [11]

APWCF: Transcription Diplomatic transcription, spelling variants preserved p traitt´ es vs. traitt´ ez vs. traittez p estat (old) or ´ etat (mod.) as appearing in the manuscript p Punctuation almost preserved, but . . . Modernized p Some adaptations of punctuation p u / v , i / j modernized p Capitalization of proper names and titles p Diacritics normalized ( lavis → l’avis , francais → franc ¸ais ) p Abbreviated titles/words: full form 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [12]

Examples 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

Example 1: ledict (chancellery style) http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=mi1&f=html ... query : doc.loc= Paris slice : 0 ∼ query : doc.loc= M¨ ∼ slice : 0 unster score : mi1 groupby : w= ledict Paris M¨ unster 2,462,443 2,462,443 N 746,786 1,153,939 f 1 284 414 f 2 = f 12 score ( mi1 ) 1.721 1.093 diff ( Paris - M¨ unster ) 0.6278 p simple example uses “unigrams” comparison profile ( f 2 = f 12 ) p pointwise mutual information (mi1) score function p “Paris” shows definite preference the archaic form ledict (chancellery style) 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

D Exploring the internal heterogeneity of a corpus of Classical - PowerPoint PPT Presentation

D Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo Bryan Jurish Annette Gerstenberg Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit at Berlin jurish@bbaw.de

25 Years of Abstract Interpretation The German Perspective Andreas Podelski University of

CarmentiS A German Early Warning Information System - Challenges and Approaches - Klaus-Peter

Convergences: Convergences Bitext + morph = IGT concern with data bilingual text

Reproducible research in practice ifgi Institute for Geoinformatics University of Mnster

Definable equivariant retractions onto skeleta in non-archimedean geometry Martin Hils

On the Cost of Using Happy Eyeballs for Transport Protocol Selection Giorgos Papastergiou,

EXTENDING ODF-FIELDS FOR SMART DOCUMENT PROCESSING WHAT ARE WE DOING? For over 25 years CIB has

Towards Knowledge-Based Assistance for Scholarly Editing Jana Kittelmann Christoph Wernhard MLU

Fall 2005 Spring 2019 We have live, synchronous, real- International Partners time

DEPTH IN SIMPLICITY: THE MAKING OF JETPACK JOYRIDE Luke Muscat Chief Creative Officer Halfbrick

Status of FAA- Issued Overseas NOTAMs/SFARs for U.S. Civil Aviation NOTAMs Country Type

Geo Key Manager Nick Sullivan (@grittygrease) Brendan McMillion O us Problem Geographically-

6D Cooling Section Bench Test and 6D Experiment Planning Vladimir Shiltsev, Andreas Jansson*

Efficient implementation of multirate time integration schemes Martin Schlegel Leibniz Institute

Sophie Middleton The Experiment , its Motivations and the Importance of Studying Muon Multiple

Dr. Zhanghua ZHENG (Tony) Global Energy Interconnection Development and Cooperation Organization

A New Formulation of Immiscible Compressible Two-Phase Flow in Porous Media Via the Concept of

B Y A H M E D K A M A L B U S I N E S S P L A N F a l l 2 0 1 7 Ca p s t o n e ,

Reduction of linear systems based on Serres theorem Alban Quadrat INRIA Sophia Antipolis,

ASREN Update @ AfREN 2015 Mee4ng 1 June 2015 Tunis

Islam, Europe, and the Riches of Asia Dar al-Islam, The Abode of Islam The Five Pillars of

Accurate Timeout Detection Despite Arbitrary Processing Delays Sixiang Ma , Yang Wang The Ohio

Logical Structures in Natural Language: Introduction R AFFAELLA B ERNARDI AND R OBERTO Z

Mutt & Friends Stefan Huber ./know | more @ cccsbg 17. Juli 2019 Stefan Huber: Mutt &

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

D Exploring the internal heterogeneity of a corpus of Classical - PowerPoint PPT Presentation

D Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo Bryan Jurish Annette Gerstenberg Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit at Berlin jurish@bbaw.de

25 Years of Abstract Interpretation The German Perspective Andreas Podelski University of

CarmentiS A German Early Warning Information System - Challenges and Approaches - Klaus-Peter

Convergences: Convergences Bitext + morph = IGT concern with data bilingual text

Reproducible research in practice ifgi Institute for Geoinformatics University of Mnster

Definable equivariant retractions onto skeleta in non-archimedean geometry Martin Hils

On the Cost of Using Happy Eyeballs for Transport Protocol Selection Giorgos Papastergiou,

EXTENDING ODF-FIELDS FOR SMART DOCUMENT PROCESSING WHAT ARE WE DOING? For over 25 years CIB has

Towards Knowledge-Based Assistance for Scholarly Editing Jana Kittelmann Christoph Wernhard MLU

Fall 2005 Spring 2019 We have live, synchronous, real- International Partners time

DEPTH IN SIMPLICITY: THE MAKING OF JETPACK JOYRIDE Luke Muscat Chief Creative Officer Halfbrick

Status of FAA- Issued Overseas NOTAMs/SFARs for U.S. Civil Aviation NOTAMs Country Type

Geo Key Manager Nick Sullivan (@grittygrease) Brendan McMillion O us Problem Geographically-

6D Cooling Section Bench Test and 6D Experiment Planning Vladimir Shiltsev, Andreas Jansson*

Efficient implementation of multirate time integration schemes Martin Schlegel Leibniz Institute

Sophie Middleton The Experiment , its Motivations and the Importance of Studying Muon Multiple

Dr. Zhanghua ZHENG (Tony) Global Energy Interconnection Development and Cooperation Organization

A New Formulation of Immiscible Compressible Two-Phase Flow in Porous Media Via the Concept of

B Y A H M E D K A M A L B U S I N E S S P L A N F a l l 2 0 1 7 Ca p s t o n e ,

Reduction of linear systems based on Serres theorem Alban Quadrat INRIA Sophia Antipolis,

ASREN Update @ AfREN 2015 Mee4ng 1 June 2015 Tunis

Islam, Europe, and the Riches of Asia Dar al-Islam, The Abode of Islam The Five Pillars of

Accurate Timeout Detection Despite Arbitrary Processing Delays Sixiang Ma , Yang Wang The Ohio

Logical Structures in Natural Language: Introduction R AFFAELLA B ERNARDI AND R OBERTO Z

Mutt &amp; Friends Stefan Huber ./know | more @ cccsbg 17. Juli 2019 Stefan Huber: Mutt &amp;

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Mutt & Friends Stefan Huber ./know | more @ cccsbg 17. Juli 2019 Stefan Huber: Mutt &