Corpus Construction and Annotation Why are annotated corpora - PowerPoint PPT Presentation

Corpus Construction and Annotation Why are annotated corpora important for computational linguists? training and evaluation of NLP tools classification (POS, word sense) parsing (syntactic structure) extraction (named entity, semantic role, coreference, events) make it possible to search for particular linguistic phenomena 1 / 25

Annotation Process target phenomena corpus selection annotation efficiency and consistency (annotation infrastructure) annotation evaluation 2 / 25

Annotation Process: Target Phenomena What linguistic phenomena do you want to annotate? Do we really need manual annotation or can this be done automatically? What resources and prior annotation are needed? syntactic annotation often depends on POS annotation semantic annotation often depends on syntactic annotation 3 / 25

Annotation Process: Corpus Selection What data do you want to annotate? Written or spoken data? (Transcribed spoken data?) Single genre or mixed genres? Representative sampling of genres? How much data do you need to find enough examples of your phenomena? a 1-million-word corpus doesn’t always contain enough occurrences of particular words for semantic role labelling or sense tagging 4 / 25

Annotation Process: Efficiency and Consistency How difficult is the annotation task? What kind of annotation guidelines need to be written? OntoNotes verb sense annotation: 11 pages Penn Treebank syntactic annotation guidelines: 300 pages How much training do the annotators need? several weeks? degree in linguistics? How consistent are the annotators? are errors due to carelessness/fatigue or lack of clear guidelines? 5 / 25

Annotation Process: Evaluation Two perspectives: external: does this annotation improve performance for a certain task? internal: do human annotators agree with each other? → inter-annotator agreement (ITA) simplest method: percentage of cases where annotators agree more useful/meaningful approaches take into account how often annotators would have agreed by chance 6 / 25

A Brief Look at Existing Corpora There are many different types of corpora, including corpora with: large amounts of representative data syntactic annotation semantic annotation word senses semantic roles temporal annotation 7 / 25

Focus on Large Amounts of Representative Data large amounts of data = no annotation or only automatic POS tagging A few English and German corpora: 1960s: Brown Corpus (1 million words, American English, written) 1980s: Lancaster/Oslo/Bergen Corpus (1 million words, British English, written) British National Corpus (100 million words, 10% spoken data) BNC Sampler (2 million words, more detailed hand-checked tags) American National Corpus (22 million words, written and spoken) DeReKo German Reference Corpus (4 billion* words) 8 / 25

British National Corpus Uses inline annotation: the annotation is inserted into the text. <head> <s n="1"><w NN2>Surnames <w CJC>and <w DPS>their <w NN2>meanings </head> <p> <s n="2"><w AT0>The <w NN1>study <w PRF>of <w AT0>the <w NN2>surnames <w PRF>of <w NN0>people <w VVG>living <w PRP-AVP>in <w AT0>a <w NN1>place <w VM0>can <w VBI>be <w CRD>one <w PRF>of <w AT0>the <w AV0>most <w AJ0>time-consuming<c PUN>, <w AJ0>frustrating<c PUN>, <w VVG-AJ0>baffling<c PUN>, <w CJC>but <w AJ0-VVG>rewarding <w NN2>activities <w AT0>a <w NN1>researcher <w VM0>can <w VVI>undertake<c PUN>. 9 / 25

American National Corpus Uses standoff annotation: the text and annotation are in separate files. one file contains the text (with a little document structure) <turn who="A" start="54.18" end="55.32" id="t1"> <u id="t1u1">So, how are you?</u> </turn> another file contains the linguistics annotation <chunk type="utterance" xml:base="#t1u1"> <tok xlink:href="xpointer(string-range(’’,�0,�2))"> <msd>ql++++</msd> <base>so</base> </tok> <tok xlink:href="xpointer(string-range(’’,�2,�3))"> <base>,</base> <msd>,+clp+++</msd> </tok> 10 / 25

Treebanks: Syntactic Annotation A few corpora with syntactic annotation: Penn Treebank (1 million words, WSJ) NEGRA/TIGER (20,000/50,000 sentences) TueBa-D/Z (50,000 sentences) Prague Dependency Treebank (2 million words) 11 / 25

Penn Treebank Format Syntactic phrase structure annotated with parentheses: (S (NP Casey) (AUX should) (VP have (VP thrown (NP the ball)))) 12 / 25

Penn Treebank Format: Converted to Tree S NP AUX VP Casey should have VP thrown NP the ball 13 / 25

Penn Treebank Format Separate files for: raw text POS tags parsed 14 / 25

Penn Treebank Format The forest-products concern currently has about 38 million shares outstanding. [ The/DT forest-products/NNS concern/NN ] currently/RB has/VBZ about/RB [ 38/CD million/CD shares/NNS ] outstanding/JJ ./. ( (S (NP-SBJ The forest-products concern) (ADVP-TMP currently) (VP has (NP (NP (QP about 38 million) shares) (ADJP outstanding))) .)) 15 / 25

Penn Treebank Format: Merged ( (S (NP-SBJ (DT The) (NNS forest-products) (NN concern) ) (ADVP-TMP (RB currently) ) (VP (VBZ has) (NP (NP (QP (RB about) (CD 38) (CD million) ) (NNS shares) ) (ADJP (JJ outstanding) ))) (. .) )) 16 / 25

NEGRA/TIGER Treebanks VROOT S HD SB OC MO VP DA MO HD NP NK NK Dieser Meinung kann ich nur voll zustimmen . PDAT NN VMFIN PPER ADV ADJD VVINF $. 17 / 25

T¨ uBa-D/Z Treebank VROOT SIMPX - - - - MF ON OA VF LK EN-ADD VC OA-MOD HD - OV PX VXFIN NX NX VXINF HD HD - - - HD HD Dafür wird Andrea Fischer wenig Zeit haben . PROP VAFIN NE NE PIAT NN VAINF $. 18 / 25

Treebank Search Tools token annotation: Corpus Query Processor (CQP) syntactic annotation: TIGERSearch 19 / 25

Prague Dependency Treebank 20 / 25

Semantic Annotation A few English corpora with semantic annotation: word senses: SemCor (360,000 words, Brown) semantic relations: PropBank (113,000 verbs, Penn Treebank WSJ) 21 / 25

Word Sense Annotation: Semantic Concordance Texts from the Brown Corpus annotated with WordNet senses: <s snum=1> <wf pos=JJ wnsn=1 lexsn=3:00:02::>Most</wf> <wf pos=NN wnsn=1 lexsn=1:04:00::>recreation</wf> <wf pos=NN wnsn=1 lexsn=1:04:00::>work</wf> <wf pos=VB wnsn=2 lexsn=2:42:00::>calls_for</wf> <wf pos=DT>a</wf> <wf pos=NN wnsn=1 lexsn=1:23:00::>good_deal</wf> <wf pos=IN>of</wf> <wf pos=JJ wnsn=0 lexsn=5:00:00:preceding(a):00>pre</wf> <wf pos=NN wnsn=1 lexsn=1:04:00::>planning</wf> <punc>.</punc> </s> 22 / 25

Word Sense Annotation: Semantic Concordance <wf pos=NN wnsn=1 lexsn=1:04:00::>recreation</wf> WordNet entry: S: (n) diversion#1, recreation#1 (an activity that diverts or amuses or stimulates) “scuba diving is provided as a diversion for tourists” S: (n) refreshment#2, recreation#2 (activity that refreshes and recreates; activity that renews your health and spirits by enjoyment and relaxation) “time for rest and refreshment by the pool” 23 / 25

Semantic Roles: PropBank Frame File for the verb expect : Roles: Arg0: expecter Arg1: thing expected Example: Transitive, active: Portfolio managers expect further declines in interest rates. Arg0: Portfolio managers REL: expect Arg1: further declines in interest rates Example: Transitive, passive: Regulatory approval is expected soon by everyone. Arg0: everyone REL: is expected Arg1: Regulatory approval 24 / 25

Where to Find Corpora and Linguistic Resources? Linguistic Data Consortium (LDC): http://www.ldc.upenn.edu European Language Resources Association (ELRA): http://www.elra.info Within SfS: /afs/sfs/resource 25 / 25

Corpus Construction and Annotation Why are annotated corpora - PowerPoint PPT Presentation

Corpus Construction and Annotation Why are annotated corpora important for computational linguists? training and evaluation of NLP tools classification (POS, word sense) parsing (syntactic structure) extraction (named entity, semantic role,

Resources for Computational Linguistics Annotation Tools: RSTTool &MMAX Presentation by

Comprehensive Annotation of Multiword Expressions in a Social Web Corpus Nathan Schneider,

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

University of Rochester Thesis Proposal Presentation Corpus Annotation and Inference with

Agreement as a window to the process of corpus annotation Ron Artstein 29 September 2012 The

Towards discourse annotation and sentiment analysis of the Basque Opinion Corpus Workshop on

Annotation of an Early New High German Corpus: The LangBank Pipeline Zarah Wei and Gohar

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

The Marburg Agreement Project Corpus, annotation and preliminary results Magnus Breder Birkenes,

Semantic Annotation of Clinical Text: The CLEF Corpus Angus Roberts, Robert Gaizauskas, Mark

The Multilingual Semantic Annotation System also a client GUI and MLCT corpus tool Scott Piao

Quality control of corpus annotation through reliability measures Ron Artstein Department of

Unsupervised discovery of Construction Grammar representations for under-resourced languages

Using Corpus Lexicography of Constructions Jesse Dunietz, Lori Levin, and Jaime Carbonell LAW

What a parsed corpus is and how to use it Anthony Kroch and Beatrice Santorini University of

Pre-processing and annotation Raw data from a linguistic source cant be exploited directly. We

Pre-processing and annotation Raw data from a linguistic source cant be exploited directly. We

Surface Construction Labeling Lori Levin Language Technologies Institute Carnegie Mellon

Our Corpus Christi Seawall - Its Construction and Its Restoration Daniel E. Garza, P.E.

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus

Lexicogrammar: Lexical Grammar or Construction Grammar? Two corpus-based case studies Costas

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Corpus-based Semantic Relatedness for the Construction of Polish WordNet Bartosz Broda 1 ,

Corpus Construction and Annotation Why are annotated corpora - PowerPoint PPT Presentation

Corpus Construction and Annotation Why are annotated corpora important for computational linguists? training and evaluation of NLP tools classification (POS, word sense) parsing (syntactic structure) extraction (named entity, semantic role,

Resources for Computational Linguistics Annotation Tools: RSTTool &amp;MMAX Presentation by

Comprehensive Annotation of Multiword Expressions in a Social Web Corpus Nathan Schneider,

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

University of Rochester Thesis Proposal Presentation Corpus Annotation and Inference with

Agreement as a window to the process of corpus annotation Ron Artstein 29 September 2012 The

Towards discourse annotation and sentiment analysis of the Basque Opinion Corpus Workshop on

Annotation of an Early New High German Corpus: The LangBank Pipeline Zarah Wei and Gohar

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

The Marburg Agreement Project Corpus, annotation and preliminary results Magnus Breder Birkenes,

Semantic Annotation of Clinical Text: The CLEF Corpus Angus Roberts, Robert Gaizauskas, Mark

The Multilingual Semantic Annotation System also a client GUI and MLCT corpus tool Scott Piao

Quality control of corpus annotation through reliability measures Ron Artstein Department of

Unsupervised discovery of Construction Grammar representations for under-resourced languages

Using Corpus Lexicography of Constructions Jesse Dunietz, Lori Levin, and Jaime Carbonell LAW

What a parsed corpus is and how to use it Anthony Kroch and Beatrice Santorini University of

Pre-processing and annotation Raw data from a linguistic source cant be exploited directly. We

Pre-processing and annotation Raw data from a linguistic source cant be exploited directly. We

Surface Construction Labeling Lori Levin Language Technologies Institute Carnegie Mellon

Our Corpus Christi Seawall - Its Construction and Its Restoration Daniel E. Garza, P.E.

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus

Lexicogrammar: Lexical Grammar or Construction Grammar? Two corpus-based case studies Costas

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Corpus-based Semantic Relatedness for the Construction of Polish WordNet Bartosz Broda 1 ,

Resources for Computational Linguistics Annotation Tools: RSTTool &MMAX Presentation by