Historical Treebanks The Penn Historical Corpora and the Icelandic - PowerPoint PPT Presentation

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1

The Penn Historical Corpora • Consist of: - the Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2) (1150-1500) - the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) (1500-1710) - the Penn Parsed Corpus of Modern British English (PPCMBE) (1700-1914) - the Parsed Corpus of Early English Correspondence (PCEEC) 2

People Tony Kroch (Beatrice) Santorini And Ann Taylor, Susan Pintzuk, the people behind the Helsinki corpus among others 3

Icelandic Parsed Historical Corpus (IcePaHC) Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. Version 0.5. http://www.linguist.is/icelandic_treebank Anton Joel 4

IcePaHC • Guidelines are based on and supplement the Penn historical corpora guidelines • Texts range in time from the 12 th century to modern times • Fewer really old texts; these are covered in full. Later texts are sampled partially. • Begins with: Fyrsta málfræðiritgerðin (The first grammatical treatise) from the 12 th century 5

Philosophy and Goals 1 • to create an annotation system that facilitates automated searches, not to give a correct linguistic analysis of each sentence. • if a construction can be found unambiguously through a combination of properties of a bracketed sentence, our annotation may not contain all of the structure that a full phrase structure diagram of the sentence would have. 6

Philosophy and Goals 2 • information is to be added in a monotonic way. • future revisions of the bracketed structures should always add information, never change it. • Hence avoid subjective judgments since they are extremely error-prone: - no distinguishing adjectival from verbal passive participles - no argument-adjunct distinction. 7

Philosophy and Goals 3 • As many categories as possible should have clear meanings so that unclear cases can be relegated to a small number of categories of residual cases. • The price of making most categories homogeneous is that these residual categories will not be. • Future revisions of the corpus may make it possible to divide some of these residual categories into homogeneous subcategories. 8

Philosophy and Goals 4 • avoid making decisions that would be controversial, whether with regard to text interpretation or to linguistic theory. • In doubtful cases, either avoid specifying structure, or use default rules to decide the case for search purposes. - VPs are normally not indicated in the corpus, since VP boundaries are normally indeterminate. - PP attachment. Whenever it is unclear where a PP attaches, attach it by default as high as possible. 9

Icelandic and English treebanks • The Icelandic treebank guidelines try to hew to the Penn Historical Treebank guidelines and overall decisions concerning the organization of the tree bank, with appropriate crosslinguistic diversions. • This allows for an easy way to identify and document crosslinguistic comparisons. 10

Layout Each text in the corpus comes in three different formats, each with a characteristic filename extension: • text (.txt) • part-of-speech (POS) tagged (.pos) • parsed (.psd) 11

The .txt file <P_2> <heading> I . (CMMALORY,2.3) Merlin (CMMALORY,2.4) </heading> HIT befel in the dayes of Uther Pendragon , when he was kynge of all Englond and so regned , that there was a myghty duke in Cornewaill that helde warre ageynst hym long tyme . (CMMALORY,2.6) and the duke was called the duke of Tyntagil . (CMMALORY,2.7) And so by meanes kynge Uther send for this duk chargyng hym to brynge his wyf with hym . (CMMALORY,2.8) 12

The .pos file <P_2>_CODE <heading>_CODE I_NUM ._. CMMALORY,2.3_ID Merlin_NPR CMMALORY,2.4_ID </heading>_CODE HIT_PRO befel_VBD in_P the_D dayes_NS of_P Uther_NPR Pendragon_NPR ,_, when_P he_PRO was_BED kynge_N of_P all_Q Englond_NPR and_CONJ so_ADV regned_VBD ,_, that_C there_EX was_BED a_D myghty_ADJ duke_N in_P Cornewaill_NPR that_C helde_VBD warre_N ageynst_P hym_PRO long_ADJ tyme_N ,_. CMMALORY,2.6_ID and_CONJ the_D duke_N was_BED called_VAN the_D duke_N of_P Tyntagil_NPR ._. CMMALORY,2.7_ID And_CONJ so_ADV by_P meanes_NS kynge_NPR Uther_NPR send_VBD for_P this_D duk_N chargyng_VAG hym_PRO to_TO brynge_VB his_PRO$ wyf_N with_P hym_PRO ,_. CMMALORY,2.8_ID 13

The .psd file Parsed have the extension .psd. Each token is enclosed with its ID in a set of unlabelled parentheses. ( (CODE <P_2>)) ( (CODE <heading>)) ( (NUMP (NUM I) (. .)) (ID CMMALORY,2.3)) ( (NP (NPR Merlin)) (ID CMMALORY,2.4)) ( (CODE </heading>)) ( (IP-MAT (CONJ and) (NP-SBJ-1 (D the) (N duke)) (BED was) (VAN called) (IP-SMC (NP-SBJ *-1) (NP-OB1 (D the) (N duke) (PP (P of) (NP (NPR Tyntagil))))) (. .)) (ID CMMALORY,2.7)) 14

Tags and Dash Tags • Tags: ADJP, ADVP, CP, FOREIGN, IP, NP, NUMP, PP, QP, W*P • Dash Tags: CP-CLF, CP-DEG, CP-EOP, CP-EXL, CP-QUE, CP- REL, CP-THT, CP-TMC IP-ABS, IP-INF, IP-MAT, IP-PPL, IP-SMC, IP-SUB NP-OB1, NP-OB2, NP-SBJ, NP-VOC, NP-TMP 15

Empty Categories • 0 – empty operator • *arb* - arbitrary PRO • *con* - subject elided under conjunction • *exp* - expletive subject • *pro* - pro subject • *ICH* - trace of movement that’s not A or A’ • *T* - trace of A-bar movement • * - trace of A-movement _# - indicates co-indexation between XP and empty categories 16

English vs. Icelandic • Case information is not marked for the most part in English. • Case information is represented explicitly in Icelandic at the word level but not at the phrase-level: (NP-SBJ (PRO-D þér-þú)) - Case information is marked on nouns, determiners, adjectives and participial verbs. 17

CorpusSearch http://corpussearch.sourceforge.net/ - a Java program for searching annotated corpora - find and count lexical and syntactic configurations of any complexity - can also be used for corpus development - uses syntactic annotation in Penn-Treebank format 18

CorpusSearch The Penn Historical Corpora and IcePaHC bundle together CorpusSearch. There is also a web-interface that comes with the DIGS_WORKSHOP demo. 19

CorpusSearch node: IP-SUB query: IP-SUB idoms NP-OB* NP-OB* matches anything that begins with NP-OB. node: IP* query: (IP* idoms NP-SBJ) AND (NP-SBJ idoms \*T*) Traces are marked by * (e.g. *T*) but * is a special character and hence must be `escaped’ by \. 20

CorpusSearch Naming in CorpusSearch: search patterns are treated like names e.g. if you re-use NP*, then all uses refer to the same element. query: (IP* idoms NP*) AND (NP* idoms D) node: IP* query: (IP* idoms NP-OB*) AND (IP* idoms NP-SBJ) AND (NP-SBJ precedes NP-OB*) 21

CorpusSearch Naming nodes: node: IP* query: (IP* idoms [1]NP-*) AND (IP* idoms [2]NP-*) AND ([1]NP-* precedes [2]NP-*) 22

CorpusSearch Negation in CorpusSearch: ! - added after relation symbol node: IP* query: IP* idoms V* AND V* iPrecedes !NP-OB1 means V* does not immediately precede NP-OB1 (and precedes something else). node: IP-SUB query: IP-SUB idoms !NP-OB* 23

Case Studies • Historical Stability of Dative Subjects in Icelandic (Ingason, Wallenberg & Sigurdsson) • The analysis of Heavy NP shift and Auxiliary contraction (Ingason & MacKenzie) 24

Historical Treebanks The Penn Historical Corpora and the Icelandic - PowerPoint PPT Presentation

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1 The Penn Historical Corpora Consist of: - the Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2) (1150-1500) - the

Introduction to treebanks Session 1: 7/08/2011 1 Outline Types of treebanks (Syntactic)

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco

Natural Language Processing Lecture 15: Treebanks and Probabilistic CFGs TREEBANKS: A

Associative anaphors in the Copenhagen Dependency Treebanks (CDT) Irn Korzen and Matthias

How to Compare Treebanks Sandra K ubler, Wolfgang Maier, Ines Rehbein & Yannick Versley

Deep Dependency Graph Conversion in English 15th International Workshop on Treebanks and

Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For discovery For evaluation

CREATING, ENRICHING AND VALORIZING TREEBANKS OF ANCIENT GREEK: THE ONGOING PEDALION PROJECT Alek

The Prague Dependency Treebanks Morphology, Syntax, Semantics Jan Haji Institute of Formal

PASSAGE: From French Parser Evaluation to Large Sized Treebanks http://atoll.inria.fr/passage

Improving Domain Independent Question Parsing with Synthetic Treebanks COLING 2018: LAW-MWE-CxG

ANLP Lecture 14 Treebanks and Statistical Parsing Shay Cohen (based on slides by Goldwater) 15

Historical Development Historical Development Historical Development Lesson No. 2 ENV H 471

Introduction to Historical Texts Over 350, 000 late 15 th to long 19 th century

Academic Quality and Social Responsibility Historical background HISTORICAL BACKGROUND 1623

Historical Background of HIV Historical Background of HIV 1984 1984 The first

ACADEMIC ENGLISH SUPPORT D O Y O U N E E D H E L P W I T H W R I T I N G O R S P E A K I N

Disciplemaking & Mission are we willing to make a difference? Pastors Brian Conyers &

Single-Phase Electronics David Christian Strategy Top Priority is finishing ASIC

Im pact of AIRS Retrievals on Im pact of AIRS Retrievals on Forecast Skill using the Forecast

More on Words More on Words Chapter 7 Chapter 7 LDER LDER Lexical Development across Lexical

Lecture Notes on Probability for 8.044: Statistical Physics I Thomas J. Greytak Physics

Design and Use of RegESM: Reg ional E arth S ystem M odel Ufuk UtkuTuruncoglu 1,2 (1) Istanbul

Earth observation image processing with the ORFEO ToolBox Remote sensing real image processing M.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Historical Treebanks The Penn Historical Corpora and the Icelandic - PowerPoint PPT Presentation

Historical Treebanks The Penn Historical Corpora and the Icelandic Historical Parsed Corpus 1 The Penn Historical Corpora Consist of: - the Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2) (1150-1500) - the

Introduction to treebanks Session 1: 7/08/2011 1 Outline Types of treebanks (Syntactic)

Linked Open Treebanks Latin treebanks in the LiLa Knowledge Base Francesco Mambrini and Marco

Natural Language Processing Lecture 15: Treebanks and Probabilistic CFGs TREEBANKS: A

Associative anaphors in the Copenhagen Dependency Treebanks (CDT) Irn Korzen and Matthias

How to Compare Treebanks Sandra K ubler, Wolfgang Maier, Ines Rehbein &amp; Yannick Versley

Deep Dependency Graph Conversion in English 15th International Workshop on Treebanks and

Using Treebanks tgrep2 Lecture 2: 07/12/2011 Using Corpora For discovery For evaluation

CREATING, ENRICHING AND VALORIZING TREEBANKS OF ANCIENT GREEK: THE ONGOING PEDALION PROJECT Alek

The Prague Dependency Treebanks Morphology, Syntax, Semantics Jan Haji Institute of Formal

PASSAGE: From French Parser Evaluation to Large Sized Treebanks http://atoll.inria.fr/passage

Improving Domain Independent Question Parsing with Synthetic Treebanks COLING 2018: LAW-MWE-CxG

ANLP Lecture 14 Treebanks and Statistical Parsing Shay Cohen (based on slides by Goldwater) 15

Historical Development Historical Development Historical Development Lesson No. 2 ENV H 471

Introduction to Historical Texts Over 350, 000 late 15 th to long 19 th century

Academic Quality and Social Responsibility Historical background HISTORICAL BACKGROUND 1623

Historical Background of HIV Historical Background of HIV 1984 1984 The first

ACADEMIC ENGLISH SUPPORT D O Y O U N E E D H E L P W I T H W R I T I N G O R S P E A K I N

Disciplemaking &amp; Mission are we willing to make a difference? Pastors Brian Conyers &amp;

Single-Phase Electronics David Christian Strategy Top Priority is finishing ASIC

Im pact of AIRS Retrievals on Im pact of AIRS Retrievals on Forecast Skill using the Forecast

More on Words More on Words Chapter 7 Chapter 7 LDER LDER Lexical Development across Lexical

Lecture Notes on Probability for 8.044: Statistical Physics I Thomas J. Greytak Physics

Design and Use of RegESM: Reg ional E arth S ystem M odel Ufuk UtkuTuruncoglu 1,2 (1) Istanbul

Earth observation image processing with the ORFEO ToolBox Remote sensing real image processing M.

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

How to Compare Treebanks Sandra K ubler, Wolfgang Maier, Ines Rehbein & Yannick Versley

Disciplemaking & Mission are we willing to make a difference? Pastors Brian Conyers &