modelling multiple sequences explorations consequences
play

Modelling Multiple Sequences: Explorations, Consequences and - PowerPoint PPT Presentation

Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36 Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures,


  1. Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36

  2. Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures, https://www.youtube.com/watch?v=ZuvRXGX8cY8 2/36

  3. What is going on? 2014 2017 ⋆ Neural-MT ⋆ Large-Vocabulary NMT ⋆ Image Captioning - NMT ⋆ OpenMT’15 - NMT ⋆ WMT’15 - NMT ⋆ Subword-NMT ⋆ Multilingual-NMT ⋆ Multi-Source NMT ⋆ Character-Dec NMT ⋆ WMT’16 - NMT ⋆ Zero-Resource NMT ⋆ Google - NMT ⋆ Fully Char-NMT ⋆ Google Zero-Shot NMT 3/36

  4. Conclusion Slide of Machine Translation Marathon’16 What Lies Ahead? Perhaps, we’ve only scratched the surface! ▸ Language barrier, surpassing human level quality. Revisiting the new territory: Character-level Larger-Context Multilingual Neural Machine Translation using, ▸ Multiple modalities ▸ Better error signals ▸ and better GPUs 4/36

  5. Multi-Sequence Modelling What is a sequence? ▸ A sequence ( x 1 ,..., x T ) can be: ▸ sentence (“visa”, “process”, “is”, “taking”, “so”, “long”, “.”) ▸ image ▸ video ▸ speech ▸ ... 5/36

  6. Multi-Sequence Modelling What is sequence modelling? “It is all about telling how likely a sequence is.” — Kyunghyun Cho ▸ Modelling in the sense of predictive modelling . ▸ What is the probability of ( x 1 ,..., x T ) ? ▸ p ( x 1 , x 2 ,..., x T ) = ∏ T t = 1 p ( x t ∣ x < t ) = ? ▸ Example: RNN language models 6/36

  7. Multi-Sequence Modelling What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences � Conditional-LM ▸ p ( x 1 ,..., x T ∣ Y ) = ∏ T t = 1 p ( x t ∣ x < t , Y ) ▸ seq2seq models, NMT 7/36

  8. Multi-Sequence Modelling What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences � Conditional-LM Multi-Way ▸ p ( x 1 ,..., x T ∣ Y ) = ∏ T t = 1 p ( x t ∣ x < t , Y ) Dec 1 Dec 2 Dec 3 ▸ seq2seq models, NMT Att Enc 1 Enc 2 Enc 3 ▸ Multi-lingual models ▸ Multi-modal models 7/36

  9. Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Do not shout from tower to tower, ▸ Go down to the common basement of all towers: interlingua 8/36

  10. Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Red Tower : source language ▸ Blue Tower: target language ▸ Green Car : alignment function 9/36

  11. Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Do NOT model the individual behaviour of a car, ▸ Model how the highway works! 10/36

  12. Sequence Modelling with Finer Tokens Issues with tokenization and segmentation ▸ Ineffective way of handling morphological variants: ’run’, ’runs’, ’running’ and ’runner’ ▸ How are we doing with compound words? Issues with treating each and every token separately ▸ Fill the vocabulary with similar words ▸ Vocabulary size grows linearly w.r.t. the corpus size ▸ Rare words, numbers and misspelled words: 9/11 is a huge contextual information ▸ Lose the learning signal of words marked as < UNK > slide credit, Junyoung Chung 11/36

  13. Granularity of Input and Output Spaces (finer tokens) 12/36

  14. Sequence Modelling with Finer Tokens We are still concerned, ▸ Less data sparsity (will still remain tho, Bengio et al.,2003) ▸ Consequences of increased sequence length! ▸ Capturing long-term dependencies ▸ Will be harder to train (but wait we have GRU, LSTM and Attention) ▸ Speed loss, 2-3 times slower but ... ▸ No need to worry about segmentation, ▸ Open vocabularies, saves us giant matrices or tricks ▸ Naturally embeds multiple languages ( Lee et al.’16 ) ▸ And may be, multiple modalities with even finer tokens. 13/36

  15. Explorations: Shared Medium Interlingua as Shared Functional Form 14/36

  16. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Luong et al. 2015 - Examines multi-task sequence to sequence learning ▸ One-to-many: MT and Syntactic Parsing ▸ Many-to-one: Translation and Image Captioning ▸ Many-to-many: Unsupervised objectives and MT 15/36

  17. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Firat et al. 2016a - Shared attention mechanism ▸ Notion of a shared function representing interlingua ▸ Trained using parallel data only ▸ Positive language transfer for low-resource (Firat et al.2016b) ▸ Single model that can translate 10 pairs 16/36

  18. Consequences: Shared Medium Interlingua as Shared Functional Form *Training with multiple language pairs has encouraged the model to find a common context vector space (we can exploit flattened manifolds). ▸ Enables multi-source translation: ▸ Multi-source training (Zoph and Knight,2016) ▸ Multi-source decoding (Firat et al.2016c) ▸ Enables zero-resource translation (Firat et al.2016c) ▸ Easily extendible to Larger-Context NMT and System Combination 17/36

  19. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Johnson et al.2016 - Google Multilingual NMT ▸ Thanh-Le Ha et al.2016 - Karlsruhe, universal encoder and decoder ▸ Mixed (multilingual) sub-word vocabularies (not chars) ▸ Enables zero-shot translation ▸ Source side code-switching (translate from a mixed source) ▸ Target side language selection (generate a mixed translation) 18/36

  20. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Lee et al.2016 - Fully Character-Level Multilingual NMT ▸ Character based decoder was already proposed (Chung et al.2016) ▸ What makes it challenging? ▸ Training time (naive approach = 3 months, Luong et al.2016) ▸ Mapping character sequence to meaning. ▸ Long range dependencies in text. ▸ Map character sequence to meaning without sacrificing speed ! 19/36

  21. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann, 2016 Model details, ▸ RNNSearch model ▸ Source-Target character level ▸ CNN+RNN encoder ▸ Two-layer simple GRU decoder ▸ { Fi , De , Cs , Ru } → En Training, ▸ Mix mini-batches ▸ Use bi-text only 20/36

  22. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Hybrid Character Encoder 21/36

  23. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Experimental results for char2char: 1. Bilingual char2char >= Bilingual bpe2char 2. Multilingual char2char > Multilingual bpe2char 2.1 more flexible in assigning model capacity to different languages 2.2 works better than most bilingual models (as well as being more parameter efficient) From Rico (comparison bpe2bpe - bpe2char - char2char): 22/36

  24. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Human evaluation: ▸ Multilingual char2char >= Bilingual char2char ▸ Bilingual char2char > Bilingual bpe2char 23/36

  25. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Additional qualitative analysis: 1. Spelling mistakes 2. Rare/long words 3. Morphology 4. Non-sense words 5. Multi-lingual sentences (code-switching) Also from Rico 2016 (LingEval97 - 97.000 Constrastive translation pairs): 1. Noun-phrase agreement 2. Subject-verb agreement 3. Separable verb particle 4. Polarity 5. Transliteration 24/36

  26. 25/36

  27. How far we can extend the existing approaches? Bigger models, complicated architectures! RNNs can express/approximate a set of Turing machines, BUT ∗ ... expressivity ≠ learnability ∗ Edward Grefenstette: Deep Learning Summer School 2016 26/36

  28. How far we can extend the existing approaches? Fast-Forward Connections for NMT, (Zhou et al., 2016) Bigger models are harder to train! ▸ Deep topology for recurrent networks (16 layers) ▸ Performance boost (+6.2 BLEU points) ▸ Fast-forward connections for gradient flow 27/36

  29. How far we can extend the existing approaches? Multi-task/Multilingual Models Bigger models are harder to train and behave differently! ▸ Scheduling the learning process ▸ Balanced Batch trick and early stopping heuristics (Firat et al.2016b, Lee et al.2016, Johnson et al.2016) 28/36

  30. How far we can extend the existing approaches? Multi-task / Multilingual Models Bigger models are harder to train and behave differently! ▸ Preventing the unlearning (catastrophic forgetting) ▸ Update scheduling heuristics (!) 29/36

  31. Interpretability ▸ Information is distributed which makes it hard to interpret ▸ What is attention model doing exactly? (Johnson et al.2016) ▸ How to dissect these giant models? ▸ Which sub-task should we use to evaluate models? ▸ Simultaneous Neural Machine Translation (Gu et al.2016) ▸ Character level alignments or importance 30/36

  32. Longer Sequences Training Latency ▸ Longer credit assignment paths (BPTT) ▸ Extended training times ▸ Bilingual bpe2bpe : 1 week ▸ Bilingual char2char : 2 weeks ▸ Multilingual (10 pairs) bpe2bpe : 3 weeks (2GPU) ▸ Multilingual (4 pairs) char2char : 2.5 months ▸ Training latency limits the search for ▸ Diverse model architectures ▸ Limited hyper-parameter search ▸ How to extend larger context, document level? 31/36

  33. What about multiple modalities? “Multi-modal Attention for Neural Machine Translation” Caglayan, Barrault and Bougares, 2016 32/36

  34. What about multiple modalities? “Lip Reading Sentences in the Wild” Chung et al., 2016 33/36

  35. Why stop at characters? 34/36

Recommend


More recommend