Character-Aware Neural Language Models Yoon Kim Yacine Jernite - PowerPoint PPT Presentation

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 / 76

Language Model Language Model (LM): probability distribution over a sequence of words. p ( w 1 , . . . , w T ) for any sequence of length T from a vocabulary V (with w i ∈ V for all i ). Important for many downstream applications: machine translation speech recognition text generation Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 2 / 76

Count-based Language Models By the chain rule, any distribution can be factorized as T � p ( w 1 , . . . , w T ) = p ( w t | w 1 , . . . , w t − 1 ) t =1 Count-based n -gram language models make a Markov assumption: p ( w t | w 1 , . . . , w t ) ≈ p ( w t | w t − n , . . . , w t − 1 ) Need smoothing to deal with rare n -grams. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 3 / 76

Neural Language Models Neural Language Models (NLM) Represent words as dense vectors in R n (word embeddings). w t ∈ R |V| : One-hot representation of word ∈ V at time t ⇒ x t = Xw t : Word embedding ( X ∈ R n ×|V| , n < |V| ) Train a neural net that composes history to predict next word. exp( p j · g ( x 1 , . . . , x t − 1 ) + q j ) p ( w t = j | w 1 , . . . , w t − 1 ) = � exp( p j ′ · g ( x 1 , . . . , x t − 1 ) + q j ′ ) j ′ ∈V = softmax( P g ( x 1 , . . . , x t − 1 ) + q ) p j ∈ R m , q j ∈ R : Output word embedding/bias for word j ∈ V g : Composition function Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 4 / 76

Feed-forward NLM (Bengio, Ducharme, and Vincent 2003) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 5 / 76

Recurrent Neural Network LM (Mikolov et al. 2011) Maintain a hidden state vector h t that is recursively calculated. h t = f ( Wx t + Uh t − 1 + b ) h t ∈ R m : Hidden state at time t (summary of history) W ∈ R m × n : Input-to-hidden transformation U ∈ R m × m : Hidden-to-hidden transformation f ( · ) : Non-linearity Apply softmax to h t . Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 9 / 76

Recurrent Neural Network LM (Mikolov et al. 2011) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 10 / 76

Word Embeddings (Collobert et al. 2011; Mikolov et al. 2012) Key ingredient in Neural Language Models. After training, similar words are close in the vector space. (Not unique to NLMs) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 13 / 76

NLM Performance (on Penn Treebank) Difficult/expensive to train, but performs well. Language Model Perplexity 5-gram count-based ( Mikolov and Zweig 2012 ) 141 . 2 RNN ( Mikolov and Zweig 2012) 124 . 7 Deep RNN ( Pascanu et al. 2013) 107 . 5 LSTM ( Zaremba, Sutskever, and Vinyals 2014) 78 . 4 Renewed interest in language modeling. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 14 / 76

NLM Issue Issue : The fundamental unit of information is still the word Separate embeddings for “trading”, “leading”, “training”, etc. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 15 / 76

NLM Issue Issue : The fundamental unit of information is still the word Separate embeddings for “trading”, “trade”, “trades”, etc. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 16 / 76

NLM Issue No parameter sharing across orthographically similar words. Orthography contains much semantic/syntactic information. How can we leverage subword information for language modeling? Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 17 / 76

Previous (NLM-based) Work Use morphological segmenter as a preprocessing step unfortunately ⇒ un PRE − fortunate STM − ly SUF Luong, Socher, and Manning 2013: Recursive Neural Network over morpheme embeddings Botha and Blunsom 2014: Sum over word/morpheme embeddings Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 18 / 76

This Work Main Idea : No morphology, use characters directly. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 76

This Work Main Idea : No morphology, use characters directly. Convolutional Neural Networks (CNN) (LeCun et al. 1989) Central to deep learning systems in vision. Shown to be effective for NLP tasks (Collobert et al. 2011) . CNNs in NLP typically involve temporal (rather than spatial) convolutions over words. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 19 / 76

Network Architecture: Overview Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 20 / 76

Character-level CNN (CharCNN) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 21 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) 1. Apply a convolution between C and H to obtain a vector f ∈ R l − w +1 f [ i ] = � C [ ∗ , i : i + w − 1] , H � where � A , B � = Tr( AB T ) is the Frobenius inner product. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Matrix representation of word (of length l ) H ∈ R d × w : Convolutional filter matrix d : Dimensionality of character embeddings (e.g. 15) w : Width of convolution filter (e.g. 1–7) 1. Apply a convolution between C and H to obtain a vector f ∈ R l − w +1 f [ i ] = � C [ ∗ , i : i + w − 1] , H � where � A , B � = Tr( AB T ) is the Frobenius inner product. 2. Take the max-over-time (with bias and nonlinearity) y = tanh(max { f [ i ] } + b ) i as the feature corresponding to the filter H (for a particular word). Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 22 / 76

Character-level CNN (CharCNN) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 23 / 76

Character-level CNN (CharCNN) C ∈ R d × l : Representation of absurdity Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 24 / 76

Character-level CNN (CharCNN) H ∈ R d × w : Convolutional filter matrix of width w = 3 Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 25 / 76

Character-level CNN (CharCNN) f [1] = � C [ ∗ , 1 : 3] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 26 / 76

Character-level CNN (CharCNN) f [ T − 2] = � C [ ∗ , T − 2 : T ] , H � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 29 / 76

Character-level CNN (CharCNN) y [1] = max { f [ i ] } i Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 30 / 76

Character-level CNN (CharCNN) Each filter picks out a character n -gram Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 31 / 76

Character-level CNN (CharCNN) f ′ [1] = � C [ ∗ , 1 : 2] , H ′ � Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 32 / 76

Character-level CNN (CharCNN) { f ′ [ i ] } y [2] = max i Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 33 / 76

Character-level CNN (CharCNN) Many filter matrices (25–200) per width (1–7) Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 34 / 76

Character-level CNN (CharCNN) Add bias, apply nonlinearity Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 35 / 76

Character-level CNN (CharCNN) Before Now Word embedding Output from CharCNN PTB Perplexity: 85 . 4 PTB Perplexity: 84 . 6 CharCNN is slower, but convolution operations on GPU have been very optimized. Can we model more complex interactions between character n -grams picked up by the filters? Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 36 / 76

Highway Network Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 37 / 76

Character-Aware Neural Language Models Yoon Kim Yacine Jernite - PowerPoint PPT Presentation

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 /

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush

Character-level Language Models With Word-level Learning Arvid Frydenlund March 16, 2018

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Neural Language Models The New Frontier of Natural Language Understanding Gabriele Sarti

Language Models: Evaluation & Neural Models CMSC 470 Marine Carpuat Slides credit: Jurasky

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With

Transfer learning with neural language models CS 685, Spring 2020 Advanced Natural Language

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian

IIT Bombays English-Indonesian submission at WAT: Integrating neural language models with SMT

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

7 Neural MT 1: Neural Encoder-Decoder Models From Section 3 to Section 6, we focused on the

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview

Dont Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for

A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint

Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural

Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533:

From Log-Linear to Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533:

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen,

Character-Aware Neural Language Models Yoon Kim Yacine Jernite - PowerPoint PPT Presentation

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush Harvard SEAS New York University Code: https://github.com/yoonkim/lstm-char-cnn Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 1 /

Character-Aware Neural Language Models Yoon Kim Yacine Jernite David Sontag Alexander Rush

Character-level Language Models With Word-level Learning Arvid Frydenlund March 16, 2018

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Maxent Models (III), &amp; Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Neural Language Models The New Frontier of Natural Language Understanding Gabriele Sarti

Language Models: Evaluation &amp; Neural Models CMSC 470 Marine Carpuat Slides credit: Jurasky

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With

Transfer learning with neural language models CS 685, Spring 2020 Advanced Natural Language

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian

IIT Bombays English-Indonesian submission at WAT: Integrating neural language models with SMT

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

7 Neural MT 1: Neural Encoder-Decoder Models From Section 3 to Section 6, we focused on the

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview

Dont Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for

A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint

Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural

Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533:

From Log-Linear to Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533:

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen,

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Language Models: Evaluation & Neural Models CMSC 470 Marine Carpuat Slides credit: Jurasky