character aware neural language models
play

Character-Aware Neural Language Models Yoon Kim Yacine Jernite - PowerPoint PPT Presentation

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 /


  1. Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 / 76

  2. Language Model Language Model (LM): probability distribution over a sequence of words. p ( w 1 , . . . , w T ) for any sequence of length T from a vocabulary V (with w i ∈ V for all i ). Important for many downstream applications: machine translation speech recognition text generation Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 2 / 76

  3. Count-based Language Models By the chain rule, any distribution can be factorized as T � p ( w 1 , . . . , w T ) = p ( w t | w 1 , . . . , w t − 1 ) t =1 Count-based n -gram language models make a Markov assumption: p ( w t | w 1 , . . . , w t ) ≈ p ( w t | w t − n , . . . , w t − 1 ) Need smoothing to deal with rare n -grams. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 3 / 76

  4. Neural Language Models Neural Language Models (NLM) Represent words as dense vectors in R n (word embeddings). w t ∈ R |V| : One-hot representation of word ∈ V at time t ⇒ x t = Xw t : Word embedding ( X ∈ R n ×|V| , n < |V| ) Train a neural net that composes history to predict next word. exp( p j · g ( x 1 , . . . , x t − 1 ) + q j ) p ( w t = j | w 1 , . . . , w t − 1 ) = � exp( p j ′ · g ( x 1 , . . . , x t − 1 ) + q j ′ ) j ′ ∈V = softmax( P g ( x 1 , . . . , x t − 1 ) + q ) p j ∈ R m , q j ∈ R : Output word embedding/bias for word j ∈ V g : Composition function Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 4 / 76

  5. Feed-forward NLM (Bengio, Ducharme, and Vincent 2003) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 5 / 76

  6. Feed-forward NLM (Bengio, Ducharme, and Vincent 2003) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 6 / 76

  7. Feed-forward NLM (Bengio, Ducharme, and Vincent 2003) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 7 / 76

  8. Feed-forward NLM (Bengio, Ducharme, and Vincent 2003) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 8 / 76

  9. Recurrent Neural Network LM (Mikolov et al. 2011) Maintain a hidden state vector h t that is recursively calculated. h t = f ( Wx t + Uh t − 1 + b ) h t ∈ R m : Hidden state at time t (summary of history) W ∈ R m × n : Input-to-hidden transformation U ∈ R m × m : Hidden-to-hidden transformation f ( · ) : Non-linearity Apply softmax to h t . Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 9 / 76

  10. Recurrent Neural Network LM (Mikolov et al. 2011) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 10 / 76

  11. Recurrent Neural Network LM (Mikolov et al. 2011) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 11 / 76

  12. Recurrent Neural Network LM (Mikolov et al. 2011) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 12 / 76

  13. Word Embeddings (Collobert et al. 2011; Mikolov et al. 2012) Key ingredient in Neural Language Models. After training, similar words are close in the vector space. (Not unique to NLMs) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 13 / 76

  14. NLM Performance (on Penn Treebank) Difficult/expensive to train, but performs well. Language Model Perplexity 5-gram count-based ( Mikolov and Zweig 2012 ) 141 . 2 RNN ( Mikolov and Zweig 2012) 124 . 7 Deep RNN ( Pascanu et al. 2013) 107 . 5 LSTM ( Zaremba, Sutskever, and Vinyals 2014) 78 . 4 Renewed interest in language modeling. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 14 / 76

  15. NLM Issue Issue : The fundamental unit of information is still the word Separate embeddings for “trading”, “leading”, “training”, etc. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 15 / 76

  16. NLM Issue Issue : The fundamental unit of information is still the word Separate embeddings for “trading”, “trade”, “trades”, etc. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 16 / 76

  17. NLM Issue No parameter sharing across orthographically similar words. Orthography contains much semantic/syntactic information. How can we leverage subword information for language modeling? Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 17 / 76

  18. Previous (NLM-based) Work Use morphological segmenter as a preprocessing step unfortunately ⇒ un PRE − fortunate STM − ly SUF Luong, Socher, and Manning 2013: Recursive Neural Network over morpheme embeddings Botha and Blunsom 2014: Sum over word/morpheme embeddings Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 18 / 76

  19. This Work Main Idea : No morphology, use characters directly. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 76

  20. This Work Main Idea : No morphology, use characters directly. Convolutional Neural Networks (CNN) (LeCun et al. 1989) Central to deep learning systems in vision. Shown to be effective for NLP tasks (Collobert et al. 2011) . CNNs in NLP typically involve temporal (rather than spatial) convolutions over words. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 76

  21. Network Architecture: Overview Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 20 / 76

  22. Character-level CNN (CharCNN) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 21 / 76

  23. Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

  24. Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) 1. Apply a convolution between C and H to obtain a vector f ∈ R l − w +1 f [ i ] = � C [ ∗ , i : i + w − 1] , H � where � A , B � = Tr( AB T ) is the Frobenius inner product. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

  25. Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) 1. Apply a convolution between C and H to obtain a vector f ∈ R l − w +1 f [ i ] = � C [ ∗ , i : i + w − 1] , H � where � A , B � = Tr( AB T ) is the Frobenius inner product. 2. Take the max-over-time (with bias and nonlinearity) y = tanh(max { f [ i ] } + b ) i as the feature corresponding to the filter H (for a particular word). Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

  26. Character-level CNN (CharCNN) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 23 / 76

  27. Character-level CNN (CharCNN) C ∈ R d × l : Representation of absurdity Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 24 / 76

  28. Character-level CNN (CharCNN) H ∈ R d × w : Convolutional filter matrix of width w = 3 Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 25 / 76

  29. Character-level CNN (CharCNN) f [1] = � C [ ∗ , 1 : 3] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 26 / 76

  30. Character-level CNN (CharCNN) f [1] = � C [ ∗ , 1 : 3] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 27 / 76

  31. Character-level CNN (CharCNN) f [2] = � C [ ∗ , 2 : 4] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 28 / 76

  32. Character-level CNN (CharCNN) f [ T − 2] = � C [ ∗ , T − 2 : T ] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 29 / 76

  33. Character-level CNN (CharCNN) y [1] = max { f [ i ] } i Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 30 / 76

  34. Character-level CNN (CharCNN) Each filter picks out a character n -gram Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 31 / 76

  35. Character-level CNN (CharCNN) f ′ [1] = � C [ ∗ , 1 : 2] , H ′ � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 32 / 76

  36. Character-level CNN (CharCNN) { f ′ [ i ] } y [2] = max i Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 33 / 76

  37. Character-level CNN (CharCNN) Many filter matrices (25–200) per width (1–7) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 34 / 76

  38. Character-level CNN (CharCNN) Add bias, apply nonlinearity Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 35 / 76

  39. Character-level CNN (CharCNN) Before Now Word embedding Output from CharCNN PTB Perplexity: 85 . 4 PTB Perplexity: 84 . 6 CharCNN is slower, but convolution operations on GPU have been very optimized. Can we model more complex interactions between character n -grams picked up by the filters? Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 36 / 76

  40. Highway Network Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 37 / 76

Recommend


More recommend