Neural Networks Language Models Philipp Koehn 1 October 2020 Philipp Koehn Machine Translation: Neural Networks 1 October 2020
N-Gram Backoff Language Model 1 • Previously, we approximated p ( W ) = p ( w 1 , w 2 , ..., w n ) • ... by applying the chain rule � p ( W ) = p ( w i | w 1 , ..., w i − 1 ) i • ... and limiting the history (Markov order) p ( w i | w 1 , ..., w i − 1 ) ≃ p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) • Each p ( w i | w i − 4 , w i − 3 , w i − 2 , w i − 1 ) may not have enough statistics to estimate → we back off to p ( w i | w i − 3 , w i − 2 , w i − 1 ) , p ( w i | w i − 2 , w i − 1 ) , etc., all the way to p ( w i ) – exact details of backing off get complicated — ”interpolated Kneser-Ney” Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Refinements 2 • A whole family of back-off schemes • Skip-n gram models that may back off to p ( w i | w i − 2 ) • Class-based models p ( C ( w i ) | C ( w i − 4 ) , C ( w i − 3 ) , C ( w i − 2 ) , C ( w i − 1 )) ⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words Philipp Koehn Machine Translation: Neural Networks 1 October 2020
First Sketch 3 w i Output Word Softmax h Hidden Layer FF w i-4 w i-3 w i-2 w i-1 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Representing Words 4 • Words are represented with a one-hot vector, e.g., – dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....) • That’s a large vector! • Remedies – limit to, say, 20,000 most frequent words, rest are OTHER – place words in √ n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Word Classes for Two-Hot Representations 5 • WordNet classes • Brown clusters • Frequency binning – sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words • Anything goes: assign words randomly to classes Philipp Koehn Machine Translation: Neural Networks 1 October 2020
6 word embeddings Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Add a Hidden Layer 7 w i Output Word Softmax h Hidden Layer FF Ew Embedding Embed Embed Embed Embed w i-4 w i-3 w i-2 w i-1 History • Map each word first into a lower-dimensional real-valued space • Shared weight matrix E Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Details (Bengio et al., 2003) 8 • Add direct connections from embedding layer to output layer • Activation functions – input → embedding: none – embedding → hidden: tanh – hidden → output: softmax • Training – loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Word Embeddings 9 Word Embedding C • By-product: embedding of word into continuous space • Similar contexts → similar embedding • Recall: distributional semantics Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Word Embeddings 10 Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Word Embeddings 11 Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Are Word Embeddings Magic? 12 • Morphosyntactic regularities (Mikolov et al., 2013) – adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw • Semantic regularities – clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities Philipp Koehn Machine Translation: Neural Networks 1 October 2020
13 recurrent neural networks Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Recurrent Neural Networks 14 Output Word Softmax Hidden Layer tanh Embedding 0 Embed w 1 History • Start: predict second word from first • Mystery layer with nodes all with value 1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Recurrent Neural Networks 15 Output Word Softmax Softmax Hidden Layer tanh tanh copy Embedding 0 Embed Embed w 1 w 2 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Recurrent Neural Networks 16 Output Word Softmax Softmax Softmax Hidden Layer tanh tanh tanh copy copy Embedding 0 Embed Embed Embed w 1 w 2 w 3 History Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Training 17 Cost y t Output Word Softmax h t Hidden Layer 0 RNN Ew t Embedding Embed w t w 1 History • Process first training example • Update weights with back-propagation Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Training 18 Cost y t Output Word Softmax h t Hidden Layer RNN RNN Ew t Embedding Embed w t w 2 History • Process second training example • Update weights with back-propagation • And so on... • But: no feedback to previous history Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Back-Propagation Through Time 19 Cost Cost Cost y t Output Word Softmax Softmax Softmax h t Hidden Layer 0 RNN RNN RNN Ew t Embedding Embed Embed Embed w t w 1 w 2 w 3 History • After processing a few training examples, update through the unfolded recurrent neural network Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Back-Propagation Through Time 20 • Carry out back-propagation though time (BPTT) after each training example – 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps • Or: update in mini-batches – process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 1 October 2020
21 long short term memory Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Vanishing Gradients 22 • Error is propagated to previous steps • Updates consider – prediction at that time step – impact on future time steps • Vanishing gradient: propagated error disappears Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Recent vs. Early History 23 • Hidden layer plays double duty – memory of the network – continuous space representation used to predict output words • Sometimes only recent context important After much economic progress over the years, the country → has • Sometimes much earlier context important The country which has made much economic progress over the years still → has Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Long Short Term Memory (LSTM) 24 • Design quite elaborate, although not very complicated to use • Basic building block: LSTM cell – similar to a node in a hidden layer – but: has a explicit memory state • Output and memory state change depends on gates – input gate : how much new input changes memory state – forget gate : how much of prior memory state is retained – output gate : how strongly memory state is passed on to next layer. • Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2) Philipp Koehn Machine Translation: Neural Networks 1 October 2020
LSTM Cell 25 LSTM Layer Time t-1 m ⊗ forget gate Preceding Layer output gate ⊗ ⊕ X i input gate Next Layer ⊗ m o h Y LSTM Layer Time t Philipp Koehn Machine Translation: Neural Networks 1 October 2020
LSTM Cell (Math) 26 • Memory and output values at time step t memory t = gate input × input t + gate forget × memory t − 1 output t = gate output × memory t • Hidden node value h t passed on to next layer applies activation function f h t = f ( output t ) • Input computed as input to recurrent neural network node x t = ( x t 1 , ..., x t – given node values for prior layer � X ) h t − 1 = ( h t − 1 – given values for hidden layer from previous time step � , ..., h t − 1 H ) 1 – input value is combination of matrix multiplication with weights w x and w h and activation function g � X � H input t = g � � w x i x t w h i h t − 1 i + i i =1 i =1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Values for Gates 27 • Gates are very important • How do we compute their value? → with a neural network layer! • For each gate a ∈ ( input , forget , output ) – weight matrix W xa to consider node values in previous layer � x t – weight matrix W ha to consider hidden layer � h t − 1 at previous time step – weight matrix W ma to consider memory at previous time step memory t − 1 � – activation function h � X H H � � � � w xa i x t w ha i h t − 1 w ma memory t − 1 gate a = h i + + i i i i =1 i =1 i =1 Philipp Koehn Machine Translation: Neural Networks 1 October 2020
Recommend
More recommend