Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36
Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures, https://www.youtube.com/watch?v=ZuvRXGX8cY8 2/36
What is going on? 2014 2017 ⋆ Neural-MT ⋆ Large-Vocabulary NMT ⋆ Image Captioning - NMT ⋆ OpenMT’15 - NMT ⋆ WMT’15 - NMT ⋆ Subword-NMT ⋆ Multilingual-NMT ⋆ Multi-Source NMT ⋆ Character-Dec NMT ⋆ WMT’16 - NMT ⋆ Zero-Resource NMT ⋆ Google - NMT ⋆ Fully Char-NMT ⋆ Google Zero-Shot NMT 3/36
Conclusion Slide of Machine Translation Marathon’16 What Lies Ahead? Perhaps, we’ve only scratched the surface! ▸ Language barrier, surpassing human level quality. Revisiting the new territory: Character-level Larger-Context Multilingual Neural Machine Translation using, ▸ Multiple modalities ▸ Better error signals ▸ and better GPUs 4/36
Multi-Sequence Modelling What is a sequence? ▸ A sequence ( x 1 ,..., x T ) can be: ▸ sentence (“visa”, “process”, “is”, “taking”, “so”, “long”, “.”) ▸ image ▸ video ▸ speech ▸ ... 5/36
Multi-Sequence Modelling What is sequence modelling? “It is all about telling how likely a sequence is.” — Kyunghyun Cho ▸ Modelling in the sense of predictive modelling . ▸ What is the probability of ( x 1 ,..., x T ) ? ▸ p ( x 1 , x 2 ,..., x T ) = ∏ T t = 1 p ( x t ∣ x < t ) = ? ▸ Example: RNN language models 6/36
Multi-Sequence Modelling What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences � Conditional-LM ▸ p ( x 1 ,..., x T ∣ Y ) = ∏ T t = 1 p ( x t ∣ x < t , Y ) ▸ seq2seq models, NMT 7/36
Multi-Sequence Modelling What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences � Conditional-LM Multi-Way ▸ p ( x 1 ,..., x T ∣ Y ) = ∏ T t = 1 p ( x t ∣ x < t , Y ) Dec 1 Dec 2 Dec 3 ▸ seq2seq models, NMT Att Enc 1 Enc 2 Enc 3 ▸ Multi-lingual models ▸ Multi-modal models 7/36
Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Do not shout from tower to tower, ▸ Go down to the common basement of all towers: interlingua 8/36
Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Red Tower : source language ▸ Blue Tower: target language ▸ Green Car : alignment function 9/36
Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Do NOT model the individual behaviour of a car, ▸ Model how the highway works! 10/36
Sequence Modelling with Finer Tokens Issues with tokenization and segmentation ▸ Ineffective way of handling morphological variants: ’run’, ’runs’, ’running’ and ’runner’ ▸ How are we doing with compound words? Issues with treating each and every token separately ▸ Fill the vocabulary with similar words ▸ Vocabulary size grows linearly w.r.t. the corpus size ▸ Rare words, numbers and misspelled words: 9/11 is a huge contextual information ▸ Lose the learning signal of words marked as < UNK > slide credit, Junyoung Chung 11/36
Granularity of Input and Output Spaces (finer tokens) 12/36
Sequence Modelling with Finer Tokens We are still concerned, ▸ Less data sparsity (will still remain tho, Bengio et al.,2003) ▸ Consequences of increased sequence length! ▸ Capturing long-term dependencies ▸ Will be harder to train (but wait we have GRU, LSTM and Attention) ▸ Speed loss, 2-3 times slower but ... ▸ No need to worry about segmentation, ▸ Open vocabularies, saves us giant matrices or tricks ▸ Naturally embeds multiple languages ( Lee et al.’16 ) ▸ And may be, multiple modalities with even finer tokens. 13/36
Explorations: Shared Medium Interlingua as Shared Functional Form 14/36
Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Luong et al. 2015 - Examines multi-task sequence to sequence learning ▸ One-to-many: MT and Syntactic Parsing ▸ Many-to-one: Translation and Image Captioning ▸ Many-to-many: Unsupervised objectives and MT 15/36
Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Firat et al. 2016a - Shared attention mechanism ▸ Notion of a shared function representing interlingua ▸ Trained using parallel data only ▸ Positive language transfer for low-resource (Firat et al.2016b) ▸ Single model that can translate 10 pairs 16/36
Consequences: Shared Medium Interlingua as Shared Functional Form *Training with multiple language pairs has encouraged the model to find a common context vector space (we can exploit flattened manifolds). ▸ Enables multi-source translation: ▸ Multi-source training (Zoph and Knight,2016) ▸ Multi-source decoding (Firat et al.2016c) ▸ Enables zero-resource translation (Firat et al.2016c) ▸ Easily extendible to Larger-Context NMT and System Combination 17/36
Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Johnson et al.2016 - Google Multilingual NMT ▸ Thanh-Le Ha et al.2016 - Karlsruhe, universal encoder and decoder ▸ Mixed (multilingual) sub-word vocabularies (not chars) ▸ Enables zero-shot translation ▸ Source side code-switching (translate from a mixed source) ▸ Target side language selection (generate a mixed translation) 18/36
Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Lee et al.2016 - Fully Character-Level Multilingual NMT ▸ Character based decoder was already proposed (Chung et al.2016) ▸ What makes it challenging? ▸ Training time (naive approach = 3 months, Luong et al.2016) ▸ Mapping character sequence to meaning. ▸ Long range dependencies in text. ▸ Map character sequence to meaning without sacrificing speed ! 19/36
Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann, 2016 Model details, ▸ RNNSearch model ▸ Source-Target character level ▸ CNN+RNN encoder ▸ Two-layer simple GRU decoder ▸ { Fi , De , Cs , Ru } → En Training, ▸ Mix mini-batches ▸ Use bi-text only 20/36
Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Hybrid Character Encoder 21/36
Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Experimental results for char2char: 1. Bilingual char2char >= Bilingual bpe2char 2. Multilingual char2char > Multilingual bpe2char 2.1 more flexible in assigning model capacity to different languages 2.2 works better than most bilingual models (as well as being more parameter efficient) From Rico (comparison bpe2bpe - bpe2char - char2char): 22/36
Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Human evaluation: ▸ Multilingual char2char >= Bilingual char2char ▸ Bilingual char2char > Bilingual bpe2char 23/36
Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Additional qualitative analysis: 1. Spelling mistakes 2. Rare/long words 3. Morphology 4. Non-sense words 5. Multi-lingual sentences (code-switching) Also from Rico 2016 (LingEval97 - 97.000 Constrastive translation pairs): 1. Noun-phrase agreement 2. Subject-verb agreement 3. Separable verb particle 4. Polarity 5. Transliteration 24/36
25/36
How far we can extend the existing approaches? Bigger models, complicated architectures! RNNs can express/approximate a set of Turing machines, BUT ∗ ... expressivity ≠ learnability ∗ Edward Grefenstette: Deep Learning Summer School 2016 26/36
How far we can extend the existing approaches? Fast-Forward Connections for NMT, (Zhou et al., 2016) Bigger models are harder to train! ▸ Deep topology for recurrent networks (16 layers) ▸ Performance boost (+6.2 BLEU points) ▸ Fast-forward connections for gradient flow 27/36
How far we can extend the existing approaches? Multi-task/Multilingual Models Bigger models are harder to train and behave differently! ▸ Scheduling the learning process ▸ Balanced Batch trick and early stopping heuristics (Firat et al.2016b, Lee et al.2016, Johnson et al.2016) 28/36
How far we can extend the existing approaches? Multi-task / Multilingual Models Bigger models are harder to train and behave differently! ▸ Preventing the unlearning (catastrophic forgetting) ▸ Update scheduling heuristics (!) 29/36
Interpretability ▸ Information is distributed which makes it hard to interpret ▸ What is attention model doing exactly? (Johnson et al.2016) ▸ How to dissect these giant models? ▸ Which sub-task should we use to evaluate models? ▸ Simultaneous Neural Machine Translation (Gu et al.2016) ▸ Character level alignments or importance 30/36
Longer Sequences Training Latency ▸ Longer credit assignment paths (BPTT) ▸ Extended training times ▸ Bilingual bpe2bpe : 1 week ▸ Bilingual char2char : 2 weeks ▸ Multilingual (10 pairs) bpe2bpe : 3 weeks (2GPU) ▸ Multilingual (4 pairs) char2char : 2.5 months ▸ Training latency limits the search for ▸ Diverse model architectures ▸ Limited hyper-parameter search ▸ How to extend larger context, document level? 31/36
What about multiple modalities? “Multi-modal Attention for Neural Machine Translation” Caglayan, Barrault and Bougares, 2016 32/36
What about multiple modalities? “Lip Reading Sentences in the Wild” Chung et al., 2016 33/36
Why stop at characters? 34/36
Recommend
More recommend