Transferring NLP models across languages and domains Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris
Statistical NLP: The Need for Data X Y the Det dog NOUN Y = f(X) ML barks VERB
Adverse Conditions ‣ Data dependence: our models dreadfully lack the ability to generalize to new conditions: CROSS - DOMAIN CROSS - LINGUAL 3
Data variability ‣ Training and test distributions typically di ff er (are not i.i.d.) ! O A M L ! G a e M d i O o n e v a h I ! L F O R g n ! L i y O a s L e r ’ u o y t a h w ‣ Domain changes ‣ Extreme case of adaptation: a new language 4
What to do about it? 5
Typical setup Traditional ML: Train & evaluate on same domain/task/language Model A Model B 6
Adaptation / Transfer Learning Transfer Learning Knowledge gained to help solve a related problem Model A Model B 7
Adapted from Ruder (2019) Transfer Learning - Details (1/2) Different domains Learning under domain shift 1 Transductive Transfer same task Cross-lingual learning 2 Different languages Transfer learning/ Adaptation Tasks learned: simultaneously Multi-task learning 3 Inductive different task Transfer Continual learning 4 sequentially 8
Transfer Learning - Details (2/2) ‣ different text types P ( X src ) 6 = P ( X trg ) Domain Adaptation (DA) ‣ different languages X src 6 = X trg Cross-lingual Learning (CL) ‣ different tasks Y src 6 = Y trg Multi-task Learning (MTL) ‣ Timing/Availability of tasks Notation: ‣ Domain D = {X , P ( X ) } P ( X ) X where is the feature space, prob. over e.g., BOW ‣ Task T = {Y , P ( Y|X ) } where is the label space (e.g., +/-) Y 9
Roadmap Domains: Learning to select data 1 Languages: Cross-lingual learning 2 Multi-task learning 3 10
Learning to select data for transfer learning with Bayesian optimization Sebastian Ruder and Barbara Plank EMNLP 2017 11
Data Setup: Multiple Source Domains Target domain Source domains How to select the most relevant data? 12
Motivation Why? Why don’t we just train on all source data? ‣ Prevent negative transfer ‣ e.g. “predictable” is negative for , but positive in Prior approaches: ‣ use a single similarity metric in isolation; ‣ focus on a single task. 13
Our approach Intuition ‣ Di ff erent tasks and domains require di ff erent notions of similarity. Idea ‣ Learn a data selection policy using Bayesian Optimization. 14
Our approach Training examples Selection policy Sorted examples x 1 x 2 S = ϕ ( x ) ⊤ w m ⋮ x m ⋮ x n ‣ Related: curriculum learning (Tsvetkov et al., 2016) Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for 15 Task-Specific Word Representation Learning. In Proceedings of ACL 2016 .
Bayesian Data Selection Policy S = φ ( X ) · w T learned feature weights different similarity/diversity features 16
Features • Similarity: Jensen-Shannon, Rényi div, Bhattacharyya dist, Cosine sim, Euclidean distance, Variational dist - Representations : Term distributions, Topic distributions, Word embeddings (Plank, 2011) • Diversity : #types, TTR, Entropy, Simpson’s index, Rényi entropy, Quadratic entropy 17
Data & Tasks Three tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging and dependency parsing on SANCL 2012 (Petrov and McDonald, 2012) Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL 2007 . Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First 18 Workshop on Syntactic Analysis of Non-Canonical Language (SANCL) .
Sentiment Analysis Results Selecting 2,000 from 6,000 source domain examples 86 80 Accuracy (%) 74 68 62 Book DVD Electronics Kitchen Random JS divergence (examples) JS divergence (domain) Similarity (topics) Diversity Similiarity + diversity All source data (6,000 examples) ‣ Selecting relevant data is useful when domains are very different. 19
POS Tagging Results Selecting 2,000 from 14-17.5k source domain examples 97 95.5 Accuracy (%) 94 92.5 91 Answers Emails Newsgroups Reviews Weblogs WSJ Random JS divergence (examples) JS divergence (domain) Similarity (terms) Diversity Similiarity + diversity All source data ‣ Learned data selection outperforms static selection, but is less useful when domains are very similar. 20
Dependency Parsing Results Selecting 2,000 from 14-17.5k source domain examples 89 Labeled Attachment Score (LAS) 86.75 84.5 82.25 80 Answers Emails Newsgroups Reviews Weblogs WSJ Random JS divergence (examples) JS divergence (domain) Similarity (terms) Diversity Similiarity + diversity All source data (BIST parser, Kiperwasser & Goldberg, 2016) 21
Do the weights transfer? 22
Cross-task transfer Target tasks Feature set T S POS Pars SA Sim POS 93.51 83.11 74.19 Sim Pars 92.78 83.27 72.79 Sim SA 86.13 67.33 79.23 Div POS 93.51 83.11 69.78 Div Pars 93.02 83.41 68.45 Div SA 90.52 74.68 79.65 Sim+div POS 93.54 83.24 69.79 Sim+div Pars 93.11 83.51 72.27 Sim+div SA 89.80 75.17 80.36 23
Take-aways ‣ Domains & tasks have di ff erent notions of similarity. Learning a task-speci fi c data selection policy helps. ‣ Preferring certain examples is mainly useful when domains are dissimilar . ‣ The learned policy transfers (to some extent) across models, tasks, and domains https://github.com/sebastianruder/learn-to-select-data Code: 24
Roadmap Domains: Learning to select data 1 Languages: Cross-lingual learning 2 Multi-task learning 3 25
🔦 Cross-lingual learning is on the rise 🔦 Papers in the ACL anthology (from 2004) 90 Title contains: Cross(-)lingual 81 Number Papers 68 54 53 45 47 35 33 23 27 22 22 15 7 3 7 3 4 13 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ‣ Includes many advances on cross-lingual representations , e.g. see ACL 2019 tutorial (Ruder et al., 2019) 26
Motivation We want to process all languages. Most of them are severely under-resourced. How to build taggers, parsers, etc. for those? 27
Approaches annotation transfer 1 (annotation projection) 2 model transfer (multi-lingual embeddings, 3 zero-shot/few-shot learning, delexicalization,…) 28
Multi-Source Annotation Projection for Dependency Parsing TACL, 2016 1
Annotation projection PRON VERB PRON ADV P Was machst du heute ? project word annotations alignments Che fesa ncuei ? PRON VERB ADV P e.g., Hwa et al. (2005) 30
Multi-Source Annotation Projection Bible: 100 (data x languages) ‣ Project from 21 source languages (Agi ć et al., 2015; 2016) 31
Approach: Projecting dependencies 32
Results Delex Multi-source Bible WTC (Watchtower) 55 44 33 UAS 22 11 0 Dependency Parsing (average UAS over 26 languages) 33
Best single source Multi-Source Proj Delex-SelectBest 70 Unlabeled Attachment Score 52.5 35 17.5 0 ‣ Single best can be better than multi-source ‣ Typologically closest language is not always the best (Lynn et al., 2014) (Indonesian is best for Irish in delexicalized transfer) ‣ Similar recent fi ndings on NER 34
Rahimi et al., ACL, 2019 Interim discussion (1/2) 35
How to automatically select the best source parser? 36
Lin et al., ACL, 2019 Interim discussion (2/2) • Data-dependent features (some similar to Ruder & Plank, 2017) including word/subword overlap, data size • Data-independent features (Geographic/Genetic distance etc) 37
Lin et al., ACL, 2019 Interim discussion: Results • Evaluation on 4 NLP tasks, including parsing (DEP) • For Dependency Parsing: • geographic > WALS syntactic features • Geographic and word overlap most indicate features 38
Labeled data Overview (some) gold 3 annotated data? Amount of supervision (Just a couple 4 of rules?) lexicons? 2 embeddings? Have parallel data? 1 multi-parallel? Unlabeled only
Lexical Resources for Low- Resource POS tagging in Neural Times NoDaLiDa 2019 & EMNLP 2018 2 Plank & Klerke, 2019; Plank & Agic, 2018 40
More and more evidence is appearing that integrating symbolic lexical knowledge into neural models aids learning Question: Does neural POS tagging bene fi t from lexical information? 41
Lexicons Wiktionary Unimorph 42
Base bi-LSTM model ‣ Hierarchical bi-LSTM with word & character embeddings (Plank et al., 2016) * able (98% adj in WSJ) bi * (85% noun in Danish) 43
How far do we get with an “all-you-can-get” approach to low-resource POS tagging? 44
Recommend
More recommend