Transferring NLP models across languages and domains Barbara Plank - PowerPoint PPT Presentation

  Transferring NLP models across languages and domains Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris

Statistical NLP: The Need for Data   X Y the Det dog NOUN Y = f(X) ML barks VERB

Adverse Conditions ‣ Data dependence: our models dreadfully lack the ability to generalize to new conditions: CROSS - DOMAIN CROSS - LINGUAL 3

              Data variability ‣ Training and test distributions typically di ff er (are not i.i.d.)     ! O A M L ! G a e M d i O o n e v a h I ! L F O R g n ! L i y O a s L e r ’ u o y t a h w ‣ Domain changes ‣ Extreme case of adaptation: a new language 4

What to do about it? 5

Typical setup Traditional ML: Train & evaluate   on same   domain/task/language Model A Model B 6

Adaptation / Transfer Learning Transfer Learning Knowledge gained   to help solve   a related problem Model A Model B 7

  Adapted from Ruder (2019) Transfer Learning - Details (1/2) Different domains Learning under domain shift   1 Transductive Transfer same task Cross-lingual learning 2 Different languages Transfer learning/ Adaptation Tasks learned: simultaneously Multi-task learning 3 Inductive different task Transfer Continual learning 4 sequentially 8

Transfer Learning - Details (2/2) ‣ different text types P ( X src ) 6 = P ( X trg ) Domain Adaptation (DA) ‣ different languages X src 6 = X trg Cross-lingual Learning (CL) ‣ different tasks Y src 6 = Y trg Multi-task Learning (MTL) ‣ Timing/Availability of tasks Notation: ‣ Domain   D = {X , P ( X ) } P ( X ) X where is the feature space, prob. over e.g., BOW ‣ Task   T = {Y , P ( Y|X ) } where is the label space (e.g., +/-) Y 9

Roadmap Domains: Learning to select data 1 Languages: Cross-lingual learning 2 Multi-task learning 3 10

Learning to select data for   transfer learning   with Bayesian optimization Sebastian Ruder and Barbara Plank   EMNLP 2017 11

Data Setup:   Multiple Source Domains Target domain Source domains How to select the most relevant data? 12

Motivation Why? Why don’t we just train on all source data? ‣ Prevent negative transfer ‣ e.g. “predictable” is negative for , but positive in Prior approaches: ‣ use a single similarity metric in isolation; ‣ focus on a single task. 13

Our approach Intuition ‣ Di ff erent tasks and domains require di ff erent notions of similarity. Idea ‣ Learn a data selection policy using Bayesian Optimization. 14

Our approach Training examples Selection policy Sorted examples x 1 x 2 S = ϕ ( x ) ⊤ w m ⋮ x m ⋮ x n ‣ Related: curriculum learning (Tsvetkov et al., 2016) Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for 15 Task-Specific Word Representation Learning. In Proceedings of ACL 2016 .

Bayesian Data Selection Policy S = φ ( X ) · w T learned feature weights different similarity/diversity features 16

Features • Similarity:   Jensen-Shannon, Rényi div, Bhattacharyya dist, Cosine sim, Euclidean distance, Variational dist - Representations :   Term distributions, Topic distributions, Word embeddings   (Plank, 2011) • Diversity : #types, TTR, Entropy, Simpson’s index, Rényi entropy, Quadratic entropy 17

Data & Tasks Three tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging and dependency parsing on SANCL 2012 (Petrov and McDonald, 2012) Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL 2007 . Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First 18 Workshop on Syntactic Analysis of Non-Canonical Language (SANCL) .

Sentiment Analysis Results Selecting 2,000 from 6,000 source domain examples 86 80 Accuracy (%) 74 68 62 Book DVD Electronics Kitchen Random JS divergence (examples) JS divergence (domain) Similarity (topics) Diversity Similiarity + diversity All source data (6,000 examples) ‣ Selecting relevant data is useful when domains are very different. 19

POS Tagging Results Selecting 2,000 from 14-17.5k source domain examples 97 95.5 Accuracy (%) 94 92.5 91 Answers Emails Newsgroups Reviews Weblogs WSJ Random JS divergence (examples) JS divergence (domain) Similarity (terms) Diversity Similiarity + diversity All source data ‣ Learned data selection outperforms static selection, but is less useful when domains are very similar. 20

Dependency Parsing Results Selecting 2,000 from 14-17.5k source domain examples 89 Labeled Attachment Score (LAS) 86.75 84.5 82.25 80 Answers Emails Newsgroups Reviews Weblogs WSJ Random JS divergence (examples) JS divergence (domain) Similarity (terms) Diversity Similiarity + diversity All source data (BIST parser, Kiperwasser & Goldberg, 2016) 21

Do the weights transfer? 22

Cross-task transfer Target tasks Feature set T S POS Pars SA Sim POS 93.51 83.11 74.19 Sim Pars 92.78 83.27 72.79 Sim SA 86.13 67.33 79.23 Div POS 93.51 83.11 69.78 Div Pars 93.02 83.41 68.45 Div SA 90.52 74.68 79.65 Sim+div POS 93.54 83.24 69.79 Sim+div Pars 93.11 83.51 72.27 Sim+div SA 89.80 75.17 80.36 23

Take-aways ‣ Domains & tasks have di ff erent notions of similarity. Learning a task-speci fi c data selection policy helps. ‣ Preferring certain examples is mainly useful when domains are dissimilar . ‣ The learned policy transfers (to some extent) across models, tasks, and domains https://github.com/sebastianruder/learn-to-select-data Code: 24

Roadmap Domains: Learning to select data 1 Languages: Cross-lingual learning 2 Multi-task learning 3 25

🔦 Cross-lingual learning is on the rise 🔦 Papers in the ACL anthology (from 2004) 90 Title contains: Cross(-)lingual 81 Number Papers 68 54 53 45 47 35 33 23 27 22 22 15 7 3 7 3 4 13 0 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ‣ Includes many advances on cross-lingual representations , e.g. see ACL 2019 tutorial (Ruder et al., 2019) 26

Motivation We want to process all languages.   Most of them are severely under-resourced. How to build taggers, parsers, etc. for those? 27

Approaches annotation transfer 1 (annotation projection) 2 model transfer   (multi-lingual embeddings, 3 zero-shot/few-shot learning, delexicalization,…) 28

Multi-Source Annotation Projection for Dependency Parsing TACL, 2016 1

Annotation projection PRON VERB PRON ADV P Was machst du heute ? project word annotations alignments Che fesa ncuei ? PRON VERB ADV P e.g., Hwa et al. (2005) 30

Multi-Source Annotation Projection Bible: 100 (data x languages) ‣ Project from 21 source languages (Agi ć et al., 2015; 2016) 31

Approach: Projecting dependencies 32

Results Delex Multi-source Bible WTC (Watchtower) 55 44 33 UAS 22 11 0 Dependency Parsing (average UAS over 26 languages) 33

Best single source Multi-Source Proj Delex-SelectBest 70 Unlabeled Attachment Score 52.5 35 17.5 0 ‣ Single best can be better than multi-source ‣ Typologically closest language is not always the best (Lynn et al., 2014) (Indonesian is best for Irish in delexicalized transfer) ‣ Similar recent fi ndings on NER 34

Rahimi et al., ACL, 2019 Interim discussion (1/2) 35

How to automatically select the best source parser? 36

Lin et al., ACL, 2019 Interim discussion (2/2) • Data-dependent features (some similar to Ruder & Plank, 2017) including word/subword overlap, data size • Data-independent features   (Geographic/Genetic distance etc) 37

Lin et al., ACL, 2019 Interim discussion: Results • Evaluation on   4 NLP tasks, including parsing (DEP) • For Dependency Parsing: • geographic   > WALS syntactic   features • Geographic and word overlap most indicate features 38

Labeled data Overview (some) gold 3 annotated data? Amount of supervision (Just a couple   4 of rules?) lexicons? 2 embeddings? Have parallel data? 1 multi-parallel? Unlabeled only

  Lexical Resources for Low- Resource POS tagging in Neural Times NoDaLiDa 2019 & EMNLP 2018   2 Plank & Klerke, 2019; Plank & Agic, 2018 40

More and more evidence is appearing that integrating symbolic lexical knowledge into neural models aids learning Question: Does neural POS tagging bene fi t from lexical information? 41

Lexicons Wiktionary Unimorph 42

Base bi-LSTM model ‣ Hierarchical bi-LSTM with word & character embeddings (Plank et al., 2016) * able (98% adj in WSJ) bi * (85% noun in Danish) 43

How far do we get with an “all-you-can-get” approach to low-resource POS tagging? 44

Transferring NLP models across languages and domains Barbara Plank - PowerPoint PPT Presentation

Transferring NLP models across languages and domains Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris Statistical NLP: The Need for Data X Y the Det dog NOUN Y = f(X) ML barks VERB Adverse

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Tax Issues in Transferring LLC and Tax Issues in Transferring LLC and Partnership Interests

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

11-830 Computational Ethics for NLP NLP for Good: Lorelei Government Investment in Languages

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Bi-Continuous Domains and Some Old Problems in Domain Theory Talk at Domains IX Klaus Keimel

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Before We Start Any questions? Context Free Languages PDAs and CFLs Languages Context Free

AOS Linux Tutorial Remote Access and Transferring Files Michael Havas Dept. of Atmospheric and

THE UNINTENDED CONSEQUENCES THE UNINTENDED CONSEQUENCES OF TRANSFERRING REAL ESTATE OF

A guide to transferring to a Havering secondary school in September 2017 For children born

19 th November 2014 The Legacy Series The Family Business: Preserving & transferring

MODERATOR: John F. DeLillo, C.P.A. VARIOUS WAYS WE PASS THE TORCH Sell/Merge Transferring

Biases in NLP Models and What It Takes to Control them Kai-Wei Chang 1 A carton of ML (NLP)

On the impact of correlation on option prices: a Malliavin Calculus approach RESULTS FROM E.

Testing Real-Time Embedded Systems Using UppAal-TRON -Tool and Application Kim G. Larsen, Marius

Quantifier Retrieval la Przepirkowski Jonathan Khoo jkhoo@sfs.uni-tuebingen.de Introduction

Welcome Emma Churchill Departmental Liaison and Policy Director, Border Delivery Group 2

Your Last Tube? Considerations for transmitter replacement Agenda Overview How to decide

CE Cables Marco Verzocchi Fermilab 9 May 2019 CE cables routing We have to route the cables

Flux Tube S-matrix bootstrap Non-Perturbative Methods in Quantum Field Theory ICTP 9/2019

San Onofre Replacement Steam Generator Tube Wear February 7, 2013 Peter T. Dietrich, Senior Vice

Transferring NLP models across languages and domains Barbara Plank - PowerPoint PPT Presentation

Transferring NLP models across languages and domains Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris Statistical NLP: The Need for Data X Y the Det dog NOUN Y = f(X) ML barks VERB Adverse

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Tax Issues in Transferring LLC and Tax Issues in Transferring LLC and Partnership Interests

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

11-830 Computational Ethics for NLP NLP for Good: Lorelei Government Investment in Languages

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Bi-Continuous Domains and Some Old Problems in Domain Theory Talk at Domains IX Klaus Keimel

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Before We Start Any questions? Context Free Languages PDAs and CFLs Languages Context Free

AOS Linux Tutorial Remote Access and Transferring Files Michael Havas Dept. of Atmospheric and

THE UNINTENDED CONSEQUENCES THE UNINTENDED CONSEQUENCES OF TRANSFERRING REAL ESTATE OF

A guide to transferring to a Havering secondary school in September 2017 For children born

19 th November 2014 The Legacy Series The Family Business: Preserving &amp; transferring

MODERATOR: John F. DeLillo, C.P.A. VARIOUS WAYS WE PASS THE TORCH Sell/Merge Transferring

Biases in NLP Models and What It Takes to Control them Kai-Wei Chang 1 A carton of ML (NLP)

On the impact of correlation on option prices: a Malliavin Calculus approach RESULTS FROM E.

Testing Real-Time Embedded Systems Using UppAal-TRON -Tool and Application Kim G. Larsen, Marius

Quantifier Retrieval la Przepirkowski Jonathan Khoo jkhoo@sfs.uni-tuebingen.de Introduction

Welcome Emma Churchill Departmental Liaison and Policy Director, Border Delivery Group 2

Your Last Tube? Considerations for transmitter replacement Agenda Overview How to decide

CE Cables Marco Verzocchi Fermilab 9 May 2019 CE cables routing We have to route the cables

Flux Tube S-matrix bootstrap Non-Perturbative Methods in Quantum Field Theory ICTP 9/2019

San Onofre Replacement Steam Generator Tube Wear February 7, 2013 Peter T. Dietrich, Senior Vice

19 th November 2014 The Legacy Series The Family Business: Preserving & transferring