L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin - PowerPoint PPT Presentation

L1-Identification Serhiy Bykh, Detmar Meurers Second Tübingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011 1

Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline point: surface-based classification 3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO 4. Future work: Towards more linguistic modeling 5. References 2

Introduction Learner Corpus L1 ! = ! English (here: L2 German) German German German L1 ! = ! Russian German German German L1 ! = ! French 3

Introduction Corpus in L2 = German German German German German L1 ! = ! English L1 ! = ! Russian L1 ! = ! French German German 4

Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline approach: surface-based classification 3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO 4. Future work: Towards more linguistic modeling 5. References 5

Previous work: Wong & Dras (2009) ! Corpus: 665 ICLEv2 essays – seven L1, with 95 (+ 15) essays per language ! Features: – 3 error types (subj-verb diagreement, noun-number disagreement, misuse of determiners) – 70/363/398 function words – 300 letter n-grams, n ! [1, 3] – 450 POS n-grams, n ! [2, 3] ! Method: SVM, 70 essays for training, 25 for testing ! Result: 73.7% accuracy (combi) 6

Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline approach: surface-based classification 3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO 4. Future work: Towards adding linguistic modeling 5. References 7

Our baseline approach: Features ! Features used: word-based recurring n-grams ! Examples (from FALKO): – n=2: und zwar, 30 Jahre, wirkliche Welt, berüfliche Ausbildung, der Abitur – n=3: was mich betrifft, von geringen Wert, müssen die Studenten – n=6: die Studenten auf die wirkliche Welt, ... ! All n-grams occurring in ! 2 texts of the used corpus ! n-grams of all occurring lengths, 2 " n " max_n(corpus) 8

Our baseline approach: Method ! Machine Learning: k -NN, different distance metrics ! Cosine, Dot Product metrics best for sparse vectors ! Testing: leave-one-out ! Features: as bit vectors (0=feature absent, 1= present) Feature bit vector feature 1 feature 2 feature 3 feature n text A 0 0 1 0 text B 1 1 1 1 text X 1 0 0 0 9

Baseline approach: ICLEv2 task ! Replication of Wong & Dras (2009), i.e., we used same dataset, but our own features & machine learning setup: ! Corpus: ICLEv2 – seven L1 (Bulgarian, Czech, French, Russian, Spanish, Chinese, Japanese) x 95 essays = 665 essays ! Feature set: word based recurring n-grams: – 1. Single n ! {2, 3, 4, 5} – 2. Intervals: ! [n, 29], n ! [2, 5] (max_n(corpus) = 29) ! [2, n], n ! [3, 6] – 3. Picked subsets: {2, 4}, {2, 5}, {2, 3, 5}, {2, 4, 5}, ... 10

Baseline approach: ICLEv2 results 11

Baseline approach: ICLEv2 results Confusion matrix for the best result 12

Baseline approach: FALKO setup ! Corpus: FALKO – Subset with 6 L1 (Rus, Uzb, Fra, Eng, Dan, Tur) x 10 essays = 60 essays ! Feature set: recurring n-grams: – intervals [2, n], n ! [2, 6] – (exploration of some other n-gram subsets) 13

Baseline approach: FALKO results Word based n-grams accuracy % features # 100 12000 90 10000 80 70 8000 63,3 60 46,7 50 6000 43,3 40 36,7 40 3399 3328 4000 3236 3054 30 20 2361 2000 10 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 14

Baseline approach: FALKO results Part-of-speech based n-grams accuracy % features # 100 12000 10924 90 9390 80 10000 70 8000 60 6560 46,7 45 50 43,3 41,7 6000 40 30 4000 3050 20 20 2000 10 670 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 15

Baseline approach: FALKO results Word + open class (N.*, VV.*, ADJ.*, CARD classes) n-grams: features # accuracy % 12000 100 90 10000 80 7894 70 7626 8000 6757 60 53,3 50 46,7 45 6000 50 43,3 4702 40 4000 30 20 2000 1917 10 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 16

Baseline approach: FALKO results Word + open class POS (matching N.*, VV.*, ADJ.*, CARD): features # accuracy % 12000 100 90 10000 80 7741 70 7530 8000 6835 60 53,3 50 46,7 46,7 46,7 6000 50 4987 40 4000 30 20 2135 2000 10 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 17

Baseline approach: FALKO results Word + ADJ.* POS (ADJA, ADJD) : features # accuracy % 12000 100 90 10000 80 70 8000 60 51,7 56,7 6000 50 43,3 36,7 35 4039 40 3965 3857 3589 4000 30 2541 20 2000 10 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 18

Baseline approach: FALKO results Word + VV.* POS (VVFIN, VVIMP, VVINF, VVIZU, VVPP) : features # accuracy % 12000 100 90 10000 80 70 8000 60 48,3 53,3 45 6000 50 41,7 38,3 4165 4090 40 3981 3699 4000 30 2551 20 2000 10 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 19

Baseline approach: FALKO results Word + N.* POS (NN, NE) accuracy % features # 100 12000 90 10000 80 70 8000 56,7 60 51,7 50 48,3 50 6000 5242 46,7 5130 4876 40 4124 4000 30 20 2322 2000 10 0 0 [2] [2,3] [2,4] [2,5] [2,6] [2] [2,3] [2,4] [2,5] [2,6] 20

Baseline approach: FALKO results ! Best results (accuracy baseline # 16.7%) ! Word based: – n = 2 (single n), cosine, 2361 feat. (max. 3801): 63.3% accuracy ! POS based: – n interval [2, 4], cosine, 6560 feat. (max. 12246): 46.7% accuracy ! Word + open class POS based: – N.*, ADJ.*, VV.*, n interval [2, 5], cosine, 7530 feat. (max. 8232): 53.3% accuracy – N.*, n subset {2, 3, 6}, cosine, 4236 feat. (max. 5663): 58.3% accuracy 21

Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline approach: surface-based classification 3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO 4. Future work: Towards adding linguistic modeling 5. References 22

Towards more linguistic modeling ! Features : from surface to more linguistic modeling ! modeling on different levels of abstraction : words, POS, lemmas, induced classes, ... ! modeling on different levels of units : phrases, dependency triples, clauses, sentences, discourse, ... ! Evaluation method : Use of other Machine Learning and Data Mining techniques ! e.g. PCA, SVM etc. 23

Towards more linguistic modeling Example: Choice of Adj N vs. N N typical? 24

References Daelemans, W. / Zavrel, J. / van der Sloot, K. / van den Bosch, A. (2010): TiMBL: Tilburg Memory Based Learner, version 6.3, Reference Guide. ILK Research Group Technical Report Series no. 10-01. (web: http://ilk.uvt.nl/downloads/pub/papers/Timbl_6.3_Manual.pdf) Diehl, Erika / Christen, Helen / Leuenberger, Sandra / Pelvat, Isabelle / Studer, Thérèse (2000): Grammatikunterricht: Alles für der Katz? Untersuchungen zum Zweitspracherwerb Deutsch. In: Henne, Helmut et al. (ed.): Reihe Germanistische Linguistik 220. Niemeyer Verlag, Tübingen. Granger, S. / Dagneaux, E. / Meunier, F. / Paquot, M. (2009): International Corpus of Learner English (Version 2). Presses Universitaires de Louvain, Louvain-la-Neuve. van Halteren, H. (2008): Source Language Markers in EUROPARL Translations. In Proceedings of the 22 nd International Conference on Computational Linguistics (COLING), pages 937–944. Koppel, M. / Schler, J. / Zigdon, K. (2005): Automatically Determining an Anonymous Author’s Native Language. In Intelligence and Security Informatics, volume 3495 of Lecture Notes in Computer Science. Springer-Verlag, pages 209-217. Odlin, Terence (1989) Language Transfer: Cross-linguistic influence in language learning, Cambridge University Press, New York. Reznicek, Marc / Walter, Maik / Schmid, Karin / Lüdeling, Anke / Hirschmann, Hagen / Krummes, Cedric (2010): Das Falko-Handbuch. Korpusaufbau und Annotationen Version 1.0.1 Tsur, O. / Rappoport, A. (2007): Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition (CACLA ’07), pages 9–16. Wong, S.-M. J. / Dras, M. (2009): Contrastive Analysis and Native Language Identification. In Proceedings of the Australasian Language Technology Association Workshop, pages 53–61. Swan, Michael / Smith, Bernard (ed.) (2001): Learner English. A teacher's guide to interference and 25 other problems. Cambridge University Press, Cambridge.

Thank you for your attention! 26

Previous work ! Koppel / Schler / Zigdon 2005; ! Corpus: ICLEv1, 5 L1 x 258 essays = 1290 essays ! Features: – 400 function words – 200 char n-grams – 185 error types – 250 POS bi-grams ! Method: SVM, 10-fold-cross-validation ! Result: 80.2% accuracy (combi) 27

Previous work ! Tsur / Rappoport 2007; ! Corpus: ICLEv1, 5 L1 x 258 essays = 1290 essays ! Features: – char n-grams, n:={1, 2, 3} ! Motivation: Influence of syllable structure of L1 on the L2 lexis – 460 function words ! Method: SVM, 10-fold-cross-validation ! Result: 65.6% accuracy (bi-grams) 28

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin - PowerPoint PPT Presentation

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011 1 Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline point: surface-based

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Agenda Unique Identification (UID); Item Unique Identification; Unique Item Identifier (UII)

Hazard Identification & Control Contents Hazard Identification & Control Hazard Alert

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Religious Profile: Jewish Identification 2 Jewish Identification (Jewish Households)

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

In System Identification, System Identification: . . . Interval (and Fuzzy) Estimates Algorithm

Person Re-Identification Yiheng Liu Outli line Background Image-Based Person

Tree Identification Tree Identification Summer Phase Summer Phase Learning to identify trees by

Convicting the Guilty, Protecting the Innocent: Double-Blind Sequential Lineup Identification

Quantitation and Identification Quantitation and Identification of Urine Mucopolysaccharides of

Gem Identification Gem Identification There are approx. 4000 minerals Approx. 100 are

Molecular tools for forensic body fluid identification body fluid identification Hwan Young Lee,

Key Issues and Questions Identification of reimbursable meal Identification of reimbursable

Fingerprint Identification Fingerprint Identification The Role of Research in Fortifying the

Identification and Specification of Identification and Specification of NGN Service and Control

The art of breaking and designing captchas Elie Bursztein Session ID: HT02-402 Insert

Recent progress on the Viana conjecture Stefano Luzzatto Abdus Salam International Centre for

SARS-CoV-2 Cause of COVID-19 Timothy Borelli, DO, Medical Director of Infectious Disease at

lti Introduction Two recent lines of research in speeding up large learning problems:

Using Machine Learning to Study the Neural Representations of Language Meanings Tom M. Mitchell

Superposition Modulo Linear Arithmetic Sup(LA) Ernst Althaus, Evgeny Kruglov, Christoph

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin - PowerPoint PPT Presentation

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011 1 Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline point: surface-based

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Agenda Unique Identification (UID); Item Unique Identification; Unique Item Identifier (UII)

Hazard Identification &amp; Control Contents Hazard Identification &amp; Control Hazard Alert

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Religious Profile: Jewish Identification 2 Jewish Identification (Jewish Households)

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

In System Identification, System Identification: . . . Interval (and Fuzzy) Estimates Algorithm

Person Re-Identification Yiheng Liu Outli line Background Image-Based Person

Tree Identification Tree Identification Summer Phase Summer Phase Learning to identify trees by

Convicting the Guilty, Protecting the Innocent: Double-Blind Sequential Lineup Identification

Quantitation and Identification Quantitation and Identification of Urine Mucopolysaccharides of

Gem Identification Gem Identification There are approx. 4000 minerals Approx. 100 are

Molecular tools for forensic body fluid identification body fluid identification Hwan Young Lee,

Key Issues and Questions Identification of reimbursable meal Identification of reimbursable

Fingerprint Identification Fingerprint Identification The Role of Research in Fortifying the

Identification and Specification of Identification and Specification of NGN Service and Control

The art of breaking and designing captchas Elie Bursztein Session ID: HT02-402 Insert

Recent progress on the Viana conjecture Stefano Luzzatto Abdus Salam International Centre for

SARS-CoV-2 Cause of COVID-19 Timothy Borelli, DO, Medical Director of Infectious Disease at

lti Introduction Two recent lines of research in speeding up large learning problems:

Using Machine Learning to Study the Neural Representations of Language Meanings Tom M. Mitchell

Superposition Modulo Linear Arithmetic Sup(LA) Ernst Althaus, Evgeny Kruglov, Christoph

Crowdsourcing using Mechanical Turk: Quality Management and Scalability Panos Ipeirotis Stern

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Hazard Identification & Control Contents Hazard Identification & Control Hazard Alert