roadmap
play

Roadmap On annotating On annotating learner corpora learner - PowerPoint PPT Presentation

ICALL: Part IV ICALL: Part IV Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar Meurers Intelligent Computer-Assisted Language Learning Universit at T ubingen Universit at T ubingen


  1. ICALL: Part IV ICALL: Part IV Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar Meurers Intelligent Computer-Assisted Language Learning Universit¨ at T¨ ubingen Universit¨ at T¨ ubingen Learner Corpora Learner Corpora Part IV: On Annotating Learner Corpora ◮ Which role can learner corpora play in Foreign Language Why they’re useful Why they’re useful On compiling learner corpora On compiling learner corpora Teaching & Second Language Acquisition (SLA) research? Why annotate corpora Why annotate corpora Data in SLA research Data in SLA research Error annotation & beyond ◮ Why is linguistic annotation relevant? Error annotation & beyond Error annotation Error annotation Detmar Meurers Linguistic Annotation Linguistic Annotation ◮ How can high quality annotation be obtained? (Universit¨ at T¨ ubingen) Annotation quality Annotation quality Why it’s important Why it’s important ◮ Corpus Representation: A Concrete Case DECCA: Variation n-gram DECCA: Variation n-gram error detection error detection ◮ The NOCE (NOn-native Corpus of English) learner corpus A Concrete Case A Concrete Case based on joint research with NOCE Corpus NOCE Corpus ◮ XML and TEI representation of the annotated corpus Luiz Amaral, Holger Wunsch, Ana D´ ıaz-Negrillo, Salvador Valera; cf. also: Linguistic Information Linguistic Information Tokenization ◮ Towards linguistic annotation of NOCE Tokenization D´ ıaz-Negrillo/Meurers/Valera/Wunsch (2009): Towards interlanguage POS-Tagging POS-Tagging POS annotation for effective learner corpora in SLA and FLT . Representation: XML, TEI Representation: XML, TEI ◮ Analyzing learner language: Automatic POS-Tagging Automatic POS-Tagging http://purl.org/dm/papers/diaz-negrillo-et-al-09.html Analyzing learner ◮ sources of evidence for POS annotation Analyzing learner language language ◮ mismatches in combining evidence Sources of Evidence Sources of Evidence Mismatching Evidence Mismatching Evidence European Summer School in Language, Logic, and Information Mismatch-free errors Mismatch-free errors Bordeaux. July 27–31, 2009 Conclusion Conclusion 1 / 46 2 / 46 ICALL: Part IV ICALL: Part IV Learner Corpora On compiling learner corpora On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar Meurers Universit¨ at T¨ ubingen Universit¨ at T¨ ubingen ◮ Many current learner language corpora consist of essays. Learner Corpora Learner Corpora Why they’re useful Why they’re useful ◮ Yet learners produce language in a wide range of On compiling learner corpora On compiling learner corpora Why annotate corpora Why annotate corpora ◮ Learner corpora can serve contexts, naturalistic or instructed, e.g., Data in SLA research Data in SLA research Error annotation & beyond Error annotation & beyond ◮ email and chat messages ◮ as a teaching resource for Foreign Language Teaching Error annotation Error annotation Linguistic Annotation ◮ answering reading or listening comprehension questions Linguistic Annotation materials design, Annotation quality Annotation quality ◮ provide insights into typical student needs, and ◮ asking questions in information gap activities Why it’s important Why it’s important ◮ contribute an empirical basis for theories of Second DECCA: Variation n-gram DECCA: Variation n-gram error detection error detection ⇒ To obtain corpora representative of learner language, it Language Acquisition. A Concrete Case A Concrete Case NOCE Corpus is important to include language produced in a variety NOCE Corpus ◮ Depending on the corpus composition, it can support Linguistic Information Linguistic Information of contexts, ideally also including longitudinal data. Tokenization Tokenization qualitative and quantitative analysis of examples found POS-Tagging POS-Tagging Representation: XML, TEI ◮ Including explicit task contexts in the meta-information Representation: XML, TEI Automatic POS-Tagging Automatic POS-Tagging of a corpus can also provide constraining information Analyzing learner Analyzing learner language useful for interpreting learner language. language Sources of Evidence Sources of Evidence ◮ e.g., it’s easier to infer what a learner wanted to say if Mismatching Evidence Mismatching Evidence Mismatch-free errors Mismatch-free errors one knows the text they are answering questions about. Conclusion Conclusion 3 / 46 4 / 46

  2. ICALL: Part IV ICALL: Part IV Annotation of Learner Corpora Annotation of Learner Corpora (cont.) On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar Meurers Universit¨ at T¨ ubingen Universit¨ at T¨ ubingen ◮ Example: Finding all sentences containing modal verbs Learner Corpora Learner Corpora Why they’re useful Why they’re useful using only the surface forms is possible, but involves On compiling learner corpora On compiling learner corpora ◮ Effective querying of corpora for specific phenomena Why annotate corpora Why annotate corpora specifying a long list of all forms of all modal verbs. Data in SLA research Data in SLA research often requires reference to corpus annotation. Error annotation & beyond Error annotation & beyond ◮ Even so, sentences where can is not actually a modal Error annotation Error annotation Linguistic Annotation Linguistic Annotation ◮ To find relevant classes of examples, the terminology would be wrongly identified: Annotation quality Annotation quality Why it’s important Why it’s important used to single out learner language aspects of interest (1) Pass me a can of beer. DECCA: Variation n-gram DECCA: Variation n-gram error detection error detection needs to be mapped to instances in the corpus (2) I can tuna for a living. A Concrete Case A Concrete Case (Meurers 2005; Meurers & M¨ uller 2009). NOCE Corpus NOCE Corpus Linguistic Information ◮ Many search patterns cannot be specified in finite form, Linguistic Information Tokenization Tokenization ◮ Annotations function as an index to classes of data POS-Tagging POS-Tagging e.g, finding all sentences with past participle verbs. Representation: XML, TEI Representation: XML, TEI which cannot easily be identified in the surface form. Automatic POS-Tagging Automatic POS-Tagging ◮ What type of learner language annotations are needed Analyzing learner Analyzing learner language language to support the searches for the data which are Sources of Evidence Sources of Evidence Mismatching Evidence Mismatching Evidence important for FLT and SLA research? Mismatch-free errors Mismatch-free errors Conclusion Conclusion 5 / 46 6 / 46 ICALL: Part IV ICALL: Part IV Data in SLA research Data in SLA research On annotating On annotating learner corpora learner corpora Clahsen & Muysken (1986) Kanno (1997), P´ erez-Lerroux & Glass (1997) Detmar Meurers Detmar Meurers Universit¨ at T¨ ubingen Universit¨ at T¨ ubingen Learner Corpora Learner Corpora ◮ They studied word order acquisition in German by ◮ They studied the use of overt and null pronouns by Why they’re useful Why they’re useful On compiling learner corpora On compiling learner corpora native speakers of Romance languages non-native speakers of Japanese and Spanish. Why annotate corpora Why annotate corpora Data in SLA research Data in SLA research Error annotation & beyond Error annotation & beyond ◮ Stages of acquisition: ◮ Examples: Error annotation Error annotation Linguistic Annotation Linguistic Annotation 1. S (Aux) V O 4. XP V[+fin] S O Annotation quality (3) Nadie dice que ´ el ganar´ a el premio. Annotation quality 2. (AdvP/PP) S (Aux) V O 5. S V[+fin] (Adv) O Why it’s important Why it’s important nobody says that he will win the prize DECCA: Variation n-gram DECCA: Variation n-gram 3. S V[+fin] O V[-fin] 6. dass S O V[+fin] error detection error detection ‘Nobody i says that he ∗ i / j will win the prize.’ A Concrete Case A Concrete Case Stage 2 example: Fr¨ uher ich kannte den Mann NOCE Corpus NOCE Corpus (4) Nadie dice que ganar´ a el premio. Linguistic Information Linguistic Information earlier AdvP I S knew V [the man] O Tokenization Tokenization nobody says that pro will win the prize POS-Tagging POS-Tagging Representation: XML, TEI Representation: XML, TEI Stage 4 example: Fr¨ uher kannte ich den Mann ‘Nobody i says that he i / j will win the prize.’ Automatic POS-Tagging Automatic POS-Tagging earlier AdvP knew V [+ fin ] I S [the man] O Analyzing learner Analyzing learner ◮ How is the data characterized? language language Sources of Evidence Sources of Evidence ◮ How is the data characterized? ◮ syntactic functions and semantic relations Mismatching Evidence Mismatching Evidence Mismatch-free errors Mismatch-free errors ◮ lexical and syntactic categories and functions ◮ not overtly expressed but interpreted elements Conclusion Conclusion 7 / 46 8 / 46

Recommend


More recommend