CEFR classification CEFR classification Introduction for German for German Julia Hancke Julia Hancke Detmar Meurers Detmar Meurers Introduction Introduction Data Data Features Exploring CEFR classification for German Features ◮ The Common European Framework of Reference for Lexical Lexical Syntactic Languages (CEFR) is an increasingly used standard for Syntactic based on rich linguistic modeling Language Model Language Model Constituency Constituency ◮ characterizing the foreign language ability of a learner Dependency Dependency Morphological Morphological ◮ based on functional abilities to use language in different NLP used for NLP used for Julia Hancke Detmar Meurers feature identification domains (public, private, occupational, etc.). feature identification Universit¨ at T¨ ubingen Experimental setup Experimental setup ◮ But there is a lack of Results Results Individual Feature Groups Individual Feature Groups ◮ authentic learner data illustrating CEFR levels and Feature Groups Feature Groups Feature Selection Feature Selection ◮ insight into the precise linguistic characteristics Qualitative feature analysis Qualitative feature analysis Learner Corpus Research Conference (LCR 2013) Summary Summary correlating with the proficiency levels. Bergen, Norway. September 27–29, 2013 1 / 21 2 / 21 CEFR classification CEFR classification Introduction Data used: German portion of MERLIN corpus for German for German Julia Hancke Julia Hancke Towards addressing the desiderata Detmar Meurers Detmar Meurers Introduction Introduction Data Data ◮ 1027 German learner texts Features Features ◮ about 200 texts per exam type (A1–C1) Lexical Lexical Syntactic Syntactic ◮ range of lengths (6–366 words) with average 122 words Language Model Language Model ◮ MERLIN is creating a learner corpus with CEFR-rated Constituency Constituency ◮ texts also vary in other parameters: Dependency Dependency essays for German, Italian & Czech (Abel et al. 2013). Morphological Morphological ◮ written for different tasks (one of three tasks per level) NLP used for NLP used for ◮ written by learners with different native languages ( > 12) feature identification feature identification ◮ How can we explore the impact of different aspects of Experimental setup Experimental setup linguistic modeling on the CEFR classification? ◮ Each text was graded in terms of CEFR levels Results Results Individual Feature Groups Individual Feature Groups ◮ by multiple trained human raters at TELC, Feature Groups Feature Groups ⇒ Use machine learning to quantify the value of different Feature Selection Feature Selection a major language test provider in Germany Qualitative feature analysis Qualitative feature analysis linguistic features for automatic proficiency classification. ◮ reliability of ratings externally validated (Univ. Leipzig) Summary Summary ◮ most common rating: B1 3 / 21 4 / 21
CEFR classification CEFR classification Features to be investigated Distribution of Ratings over CEFR levels for German for German Julia Hancke Julia Hancke Number of texts per essay rating level Detmar Meurers Detmar Meurers Introduction Introduction Data Data ◮ Goal: richer linguistic modeling of CEFR levels Features Features 300 Lexical Lexical ⇒ explore potentially relevant language features Syntactic Syntactic Language Model Language Model ⇒ test their impact on predicting CEFR class of each essay Constituency Constituency 250 Dependency Dependency Morphological Morphological ◮ We explored: NLP used for NLP used for 200 feature identification feature identification ◮ lexical features Experimental setup Experimental setup ◮ syntactic features Results Results 150 ◮ statistical language model Individual Feature Groups Individual Feature Groups Feature Groups Feature Groups ◮ constituency-based Feature Selection Feature Selection 100 ◮ dependency-based Qualitative feature analysis Qualitative feature analysis Summary Summary ◮ morphological features 50 0 A1 A2 B1 B2 C1 5 / 21 6 / 21 CEFR classification CEFR classification Features explored Features explored for German for German Julia Hancke Julia Hancke Lexical features Syntactic features: 1. Statistical Language Models Detmar Meurers Detmar Meurers Introduction Introduction Data Data ◮ Lexical density (Lu 2012) Features Features ◮ inspired by readability assessment research ◮ ratio of number of lexical words to total number of words Lexical Lexical Syntactic Syntactic (Schwarm & Ostendorf 2005; Petersen & Ostendorf 2009; Feng 2010) Language Model Language Model ◮ Lexical diversity: Constituency Constituency Dependency Dependency ◮ used SRILM Language Modeling Toolkit (Stolcke 2002) ◮ TTR variants, MTLD, lexical word variation Morphological Morphological NLP used for NLP used for (McCarthy & Jarvis 2010; Crossley et al. 2011a; Lu 2012) feature identification feature identification ◮ trained on two data sets (Hancke, Meurers & Vajjala 2012) Experimental setup Experimental setup ◮ Depth of lexical knowledge ◮ easy : 2000 texts, German kid news website News4Kids Results Results ◮ lexical frequency scores (Crossley et al. 2011b) Individual Feature Groups Individual Feature Groups ◮ hard : 2000 texts, German news channel NTV website Feature Groups Feature Groups Feature Selection Feature Selection ◮ Lexical relatedness Qualitative feature analysis Qualitative feature analysis ◮ 12 features: unigram, bigram and trigram perplexity for ◮ hypernym & polysemy scores (Crossley et al. 2009) Summary Summary ◮ easy or hard text models based on ◮ Shallow measures ◮ word or mixed (word+POS) representations ◮ spelling errors per number of words, word length 7 / 21 8 / 21
CEFR classification CEFR classification Features explored Features explored for German for German Julia Hancke Julia Hancke Syntactic features: 3. Theory-driven constituency features Syntactic features: 2. Data-driven constituency features Detmar Meurers Detmar Meurers (Hancke, Meurers & Vajjala 2012) Introduction Introduction ◮ Is the frequency of common rules characteristic? Data Data Features Features (Briscoe et al. 2010; Yannakoudakis et al. 2011) Lexical Lexical Syntactic properties assumed to be characteristic of complexity Syntactic Syntactic ◮ Extracted all rules in the parse trees assigned by Language Model or difficulty in SLA proficiency and readability research: Language Model Constituency Constituency Dependency Dependency Stanford Parser in 700 articles from the NTV corpus ◮ number and length of Morphological Morphological NLP used for NLP used for ◮ clauses, sentences, T-units S feature identification feature identification ◮ NPs, VPs, PPs Experimental setup Experimental setup ◮ dependent clauses and coordinated phrases NP VP Results Results S → NP VP Individual Feature Groups Individual Feature Groups ◮ per clause, sentence, T-unit NP → NNP ◮ Feature Groups Feature Groups NNP VPZ ADJP Feature Selection Feature Selection VP → VPZ ADJP ◮ interrogative, relative, conjoined clause ratios Qualitative feature analysis Qualitative feature analysis Summary Summary ◮ nonterminals per sentence Norway is beautiful ◮ parse tree height ◮ Given a learner text, for each rule, we use as feature: rule frequency in text / number of words in text 9 / 21 10 / 21 CEFR classification CEFR classification Features explored Features explored for German for German Julia Hancke Julia Hancke Syntactic features: 4. Theory-driven dependency features Morphological features Detmar Meurers Detmar Meurers (Vor der Br¨ uck et al. 2008; Yannakoudakis et al. 2011; Dell’Orletta et al. 2011) Introduction Introduction Data Data Features Features Lexical Lexical Syntactic ◮ Word Formation Syntactic Language Model Language Model Linguistic properties based on dependency analysis used in Constituency Constituency ◮ ratios of nominal suffixes ( -ung , -heit ) and compounds Dependency Dependency SLA proficiency and readability assessment research: Morphological Morphological ◮ Inflectional Morphology NLP used for NLP used for ◮ number of words between head and dependent feature identification feature identification ◮ of verb: person, mood, verb-form (participle, infinitive) Experimental setup Experimental setup ◮ maximum ◮ of noun: case Results Results ◮ average number per sentence Individual Feature Groups Individual Feature Groups ◮ Tense: Feature Groups Feature Groups ◮ avg. number of dependents per verb (in words) Feature Selection Feature Selection ◮ frequency ratios of verbal tense features Qualitative feature analysis Qualitative feature analysis ◮ number of dependents per NP (in words) Summary ◮ data-driven, based on 700 texts from NTV corpus Summary 11 / 21 12 / 21
Recommend
More recommend