d
play

D Exploring the internal heterogeneity of a corpus of Classical - PowerPoint PPT Presentation

D Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo Bryan Jurish Annette Gerstenberg Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit at Berlin jurish@bbaw.de


  1. D Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo Bryan Jurish Annette Gerstenberg Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit¨ at Berlin jurish@bbaw.de annette.gerstenberg@fu-berlin.de Global Philology Open Conference Universit¨ at Leipzig 22 nd February, 2016 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

  2. Overview The Situation p Diachronic (heterogeneous) Text Corpora p Collocation Profiling p Diachronic Collocation Profiling DiaCollo p Requests & Parameters p Profile, Diffs & Indices APWCF Corpus p Background p Subcorpora p Sources & Enrichments Examples Summary & Conclusion 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

  3. The Situation: Diachronic Text Corpora p heterogeneous text collections, especially with respect to date of origin t other partitionings may be relevant too, e.g. by genre, location , etc. p increasing number available for linguistic & humanities research, e.g. t Deutsches Textarchiv (DTA) (Geyken 2013) t Referenzkorpus Altdeutsch (DDD) (Richling 2011) t Corpus of Historical American English (COHA) (Davies 2012) p . . . but even putatively “synchronic” corpora have a temporal extension, e.g. t DWDS/ZEIT ( Kohl ∼ politician vs. “cabbage”) (1946–2016) t DDR Presseportal ( Ausreise ∼ “departure”) (1945–1993) t DWDS/Blogs (“Browser”) (1994–2016) p should expose temporal effects of e.g. semantic shift , discourse trends p problematic for conventional natural language processing tools t implicit assumptions of homogeneity 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [2]

  4. The Situation: Collocation Profiling “You shall know a word by the company it keeps” — J. R. Firth Basic Idea (Church & Hanks 1990; Manning & Sch¨ utze 1999; Evert 2005) p lookup all candidate collocates ( w 2 ) occurring with the target term ( w 1 ) p rank candidates by association score t “chance” co-occurrences with high-frequency items must be filtered out ! t statistical methods require large data sample What for? p computational lexicography (Kilgarriff & Tugwell 2002; Didakowski & Geyken 2013) p neologism detection (Kilgarriff et al. 2015) p distributional semantics (Sch¨ utze 1992; Sahlgren 2006) p “text mining” / “distant reading” (Heyer et al. 2006; Moretti 2013) 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [3]

  5. Diachronic Collocation Profiling The Problem: (temporal) heterogeneity p conventional collocation extractors assume corpus homogeneity p co-occurrence frequencies are computed only for word-pairs ( w 1 , w 2 ) p influence of occurrence date (and other document properties) is irrevocably lost A Solution (sketch) p represent terms as n -tuples of independent attributes (including occurrence date) t alternative: “document” level co-occurrences over sparse TDF matrix p partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”) p collect independent slice-wise profiles into final result set Advantages Drawbacks t full support for diachronic axis t sparse data requires larger corpora t variable query-level granularity t computationally expensive t flexible attribute selection t large index size t multiple association scores t no syntactic relations (yet) 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [4]

  6. DiaCollo: Requests & Parameters p Perl API, RESTful web-service (Fielding 2000) + web-form GUI p accepts user requests as set of parameter=value pairs p parameter passing via URL query string or HTTP POST request p common parameters: Parameter Description target collocant lemma(ta), regular expression, or DDC query query target date(s), interval, or regular expression date epoch granularity or “0” (zero) for a date-independent profile slice groupby projected collocate attributes with optional restrictions association score function for collocate ranking score maximum number of collocate items to return per epoch kbest binary score comparison operation for diff profiles diff global request global profile pruning (vs. default epoch-local pruning) profile type to be computed ( { native,tdf,ddc } × { unary,diff } ) profile output format or visualization mode (e.g. TSV, JSON, HTML, d3-cloud, . . . ) format 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [5]

  7. DiaCollo: Profiles, Diffs & Indices Profiles & Diffs p simple request → unary profile for target term(s) ( profile , query ) t filtered & projected to selected attribute(s) ( groupby ) t trimmed to k -best collocates for target word(s) ( score , kbest , global ) t aggregated into independent epoch-wise sub-intervals ( date , slice ) p diff request → comparison of two independent targets ( profile , bquery , . . . ) t highlights differences or similarities of target queries ( diff ) t can be used to compare different words ( query � = bquery ) . . . or different corpus subsets w.r.t. a given word (e.g. date � = bdate ) Indices & Attributes p compile-time filtering of native indices: frequency threshholds, PoS-tags p default index attributes: Lemma (l) , Pos (p) p finer-grained queries possible with TDF or DDC back-ends p batteries not included : corpus preprocessing, analysis, & full-text search index t see e.g. Jurish (2003); Geyken & Hanneforth (2006); Jurish et al. (2014), . . . 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [6]

  8. DiaCollo: Scoring & Comparison Functions Selected Association Score Functions p f raw collocation frequency = f 12 p lf collocation log-frequency = log 2 ( f 12 + ε ) f 12 × N p mi1 pointwise mutual information ≈ log 2 f 1 × f 2 f 12 × N p milf pointwise MI × log-frequency ≈ log 2 f 1 × f 2 × log 2 f 12 p ll log-likelihood (Dunning 1993) ≈ sgn( f 12 | f 1 , f 2 ) × log(1 + log λ ) 2 × f 12 p ld log-Dice coefficient (Rychl´ y 2008) ≈ 14 + log 2 f 1 + f 2 Selected Diff Operations p diff raw score difference = s a − s b p adiff absolute score difference = | s a − s b | = 1 p avg arithmetic average 2 ( s a + s b ) p max maximum = max { s a , s b } p min minimum = min { s a , s b } ≈ 2 × s a × s b p havg harmonic average s a + s b 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [7]

  9. APWCF: From diplomatic correspondence to a corpus of Classical French p Classical French : same structures as modern French, but t linguistic norm is not yet stable t variation and patterns of usage of grammatical features t linguistic change on the levels of semantics and pragmatics p Acta Pacis Westphalicae (1643–1648): The French correspondence t Diplomatic letters between Paris (government) and diplomats at M¨ unster t Ambassadors are committed to achieving diplomatic goals � convincing the government to adapt the instructions t Diplomatic letters: formal constraints versus expressive needs p Linguistic interest t Diachronic variation : comparison with existing resources of Classical French t Synchronic variation : genre-internal heterogeneity p Genre-internal heterogeneity : hypothesis of different levels of formality t Two subcorpora: ”government” (Paris) and ”ambassadors” (M¨ unster) t Register-variation reflected in the use of linguistic variables 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [8]

  10. APWCF: Correspondence 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [9]

  11. APWCF: Subcorpora 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [10]

  12. APWCF: Data p Letters mostly conserved in French Archives t Archives du Minist` ere des Affaires Etrang` eres t microforms: Zentrum f¨ ur Historische Forschung, Bonn p Digital edition (PDF, XML) t Bayerische Staatsbibliothek t Zentrum f¨ ur historische Forschung Bonn p Linguistic corpus: AG t Part-of-Speech Tagging (PRESTO, Cologne/Lyon) t XML / TXM (Lyon) p Corpus size: 8 volumes of French edition, 2.4M Tokens 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [11]

  13. APWCF: Transcription Diplomatic transcription, spelling variants preserved p traitt´ es vs. traitt´ ez vs. traittez p estat (old) or ´ etat (mod.) as appearing in the manuscript p Punctuation almost preserved, but . . . Modernized p Some adaptations of punctuation p u / v , i / j modernized p Capitalization of proper names and titles p Diacritics normalized ( lavis → l’avis , francais → franc ¸ais ) p Abbreviated titles/words: full form 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [12]

  14. Examples 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

  15. Example 1: ledict (chancellery style) http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=mi1&f=html ... query : doc.loc= Paris slice : 0 ∼ query : doc.loc= M¨ ∼ slice : 0 unster score : mi1 groupby : w= ledict Paris M¨ unster 2,462,443 2,462,443 N 746,786 1,153,939 f 1 284 414 f 2 = f 12 score ( mi1 ) 1.721 1.093 diff ( Paris - M¨ unster ) 0.6278 p simple example uses “unigrams” comparison profile ( f 2 = f 12 ) p pointwise mutual information (mi1) score function p “Paris” shows definite preference the archaic form ledict (chancellery style) 2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

Recommend


More recommend