Natural Language Processing (CSE 490U): Neural Language Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 13–18, 2017 1 / 57
Quick Review A language model is a probability distribution over V † . Typically p decomposes into probabilities p ( x i | h i ) . ◮ n-gram: h i is ( n − 1) previous symbols; estimate by counting and normalizing (with smoothing) ◮ log-linear: featurized representation of � h i , x i � ; estimate iteratively by gradient descent Next: neural language models 2 / 57
Neural Network: Definitions Warning: there is no widely accepted standard notation! A feedforward neural network n ν is defined by: ◮ A function family that maps parameter values to functions of the form n : R d in → R d out ; typically: ◮ Non-linear ◮ Differentiable with respect to its inputs ◮ “Assembled” through a series of affine transformations and non-linearities, composed together ◮ Symbolic/discrete inputs handled through lookups. ◮ Parameter values ν ◮ Typically a collection of scalars, vectors, and matrices ◮ We often assume they are linearized into R D 3 / 57
A Couple of Useful Functions ◮ softmax : R k → R k � � e x 1 e x 2 e x k � x 1 , x 2 , . . . , x k � �→ j =1 e x j , j =1 e x j , . . . , � k � k � k j =1 e x j ◮ tanh : R → [ − 1 , 1] x �→ e x − e − x e x + e − x Generalized to be elementwise , so that it maps R k → [ − 1 , 1] k . ◮ Others include: ReLUs, logistic sigmoids, PReLUs, . . . 4 / 57
“One Hot” Vectors Arbitrarily order the words in V , giving each an index in { 1 , . . . , V } . Let e i ∈ R V contain all zeros, with the exception of a 1 in position i . This is the “one hot” vector for the i th word in V . 5 / 57
Feedforward Neural Network Language Model (Bengio et al., 2003) Define the n-gram probability as follows: � � p ( · | � h 1 , . . . , h n − 1 � ) = n ν � e h 1 , . . . , e h n − 1 � = n − 1 n − 1 � � ⊤ M e ⊤ softmax b V + e h j V × d A j + W V × H tanh u H + h j M T j j =1 d × V j =1 d × H V where each e h j ∈ R V is a one-hot vector and H is the number of “hidden units” in the neural network (a “hyperparameter”). Parameters ν include: ◮ M ∈ R V × d , which are called “embeddings” (row vectors), one for every word in V ◮ Feedforward NN parameters b ∈ R V , A ∈ R ( n − 1) × d × V , W ∈ R V × H , u ∈ R H , T ∈ R ( n − 1) × d × H 6 / 57
Breaking It Down Look up each of the history words h j , ∀ j ∈ { 1 , . . . , n − 1 } in M ; keep two copies. ⊤ M e h j V × d V ⊤ M e h j V × d V 7 / 57
Breaking It Down Look up each of the history words h j , ∀ j ∈ { 1 , . . . , n − 1 } in M ; keep two copies. Rename the embedding for h j as m h j . ⊤ M = m h j e h j ⊤ M = m h j e h j 8 / 57
Breaking It Down Apply an affine transformation to the second copy of the history-word embeddings ( u , T ) m h j n − 1 � u H + m h j T j j =1 d × H 9 / 57
Breaking It Down Apply an affine transformation to the second copy of the history-word embeddings ( u , T ) and a tanh nonlinearity. m h j n − 1 � u + tanh m h j T j j =1 10 / 57
Breaking It Down Apply an affine transformation to everything ( b , A , W ). n − 1 � V + b m h j A j j =1 d × V n − 1 � u + + W V × H tanh m h j T j j =1 11 / 57
Breaking It Down Apply a softmax transformation to make the vector sum to one. n − 1 � b + softmax m h j A j j =1 n − 1 � u + + W tanh m h j T j j =1 12 / 57
Breaking It Down n − 1 � b + softmax m h j A j j =1 n − 1 � u + + W tanh m h j T j j =1 Like a log-linear language model with two kinds of features: ◮ Concatenation of context-word embeddings vectors m h j ◮ tanh -affine transformation of the above New parameters arise from (i) embeddings and (ii) affine transformation “inside” the nonlinearity. 13 / 57
Visualization tanh u, T M W softmax b, A 14 / 57
Number of Parameters D = V d + V + ( n − 1) dV + V H + H + ( n − 1) dH ���� ���� ���� ���� � �� � � �� � M b W u A T For Bengio et al. (2003): ◮ V ≈ 18000 (after OOV processing) ◮ d ∈ { 30 , 60 } ◮ H ∈ { 50 , 100 } ◮ n − 1 = 5 So D = 461 V + 30100 parameters, compared to O ( V n ) for classical n-gram models. ◮ Forcing A = 0 eliminated 300 V parameters and performed a bit better, but was slower to converge. ◮ If we averaged m h j instead of concatenating, we’d get to 221 V + 6100 (this is a variant of “continuous bag of words,” Mikolov et al., 2013). 15 / 57
Why does it work? 16 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. 17 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. ◮ Suppose y = xor( x 1 , x 2 ) ; this can’t be expressed as a linear function of x 1 and x 2 . 18 / 57
xor Example x 2 x 1 y Correct tuples are marked in red; incorrect tuples are marked in blue. 19 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. ◮ Suppose y = xor( x 1 , x 2 ) ; this can’t be expressed as a linear function of x 1 and x 2 . But: z = x 1 · x 2 y = x 1 + x 2 − 2 z 20 / 57
xor Example ( D = 13 ) Credit: Chris Dyer ( https://github.com/clab/cnn/blob/master/examples/xor.cc ) 5 ● 4 mean squared error 3 ●● ● ● ● 2 ●●●●●●●●●●●●●●●●●●●●●●●● 1 0 0 5 10 15 20 25 30 iterations � � � � 2 � � ⊤ min xor( x 1 , x 2 ) − v W 3 × 2 x 2 + b + a v ,a, W , b 3 3 x 1 ∈{ 0 , 1 } x 2 ∈{ 0 , 1 } � � � � 2 � � ⊤ tanh min xor( x 1 , x 2 ) − v 3 × 2 x 2 + b + a W v ,a, W , b 3 3 x 1 ∈{ 0 , 1 } x 2 ∈{ 0 , 1 } 21 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. ◮ Suppose y = xor( x 1 , x 2 ) ; this can’t be expressed as a linear function of x 1 and x 2 . But: z = x 1 · x 2 y = x 1 + x 2 − 2 z ◮ With high-dimensional inputs, there are a lot of conjunctive features to search through (recall from last time that Della Pietra et al., 1997 did so, greedily). 22 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. ◮ Suppose y = xor( x 1 , x 2 ) ; this can’t be expressed as a linear function of x 1 and x 2 . But: z = x 1 · x 2 y = x 1 + x 2 − 2 z ◮ With high-dimensional inputs, there are a lot of conjunctive features to search through (recall from last time that Della Pietra et al., 1997 did so, greedily). ◮ Neural models seem to smoothly explore lots of approximately-conjunctive features. 23 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. ◮ Suppose y = xor( x 1 , x 2 ) ; this can’t be expressed as a linear function of x 1 and x 2 . But: z = x 1 · x 2 y = x 1 + x 2 − 2 z ◮ With high-dimensional inputs, there are a lot of conjunctive features to search through (recall from last time that Della Pietra et al., 1997 did so, greedily). ◮ Neural models seem to smoothly explore lots of approximately-conjunctive features. ◮ Modern answer: representations of words and histories are tuned to the prediction problem. 24 / 57
Why does it work? ◮ Historical answer: multiple layers and nonlinearities allow feature combinations a linear model can’t get. ◮ Suppose y = xor( x 1 , x 2 ) ; this can’t be expressed as a linear function of x 1 and x 2 . But: z = x 1 · x 2 y = x 1 + x 2 − 2 z ◮ With high-dimensional inputs, there are a lot of conjunctive features to search through (recall from last time that Della Pietra et al., 1997 did so, greedily). ◮ Neural models seem to smoothly explore lots of approximately-conjunctive features. ◮ Modern answer: representations of words and histories are tuned to the prediction problem. ◮ Word embeddings: a powerful idea . . . 25 / 57
Important Idea: Words as Vectors The idea of “embedding” words in R d is much older than neural language models. 26 / 57
Important Idea: Words as Vectors The idea of “embedding” words in R d is much older than neural language models. You should think of this as a generalization of the discrete view of V . 27 / 57
Important Idea: Words as Vectors The idea of “embedding” words in R d is much older than neural language models. You should think of this as a generalization of the discrete view of V . ◮ Considerable ongoing research on learning word representations to capture linguistic similarity (Turney and Pantel, 2010); this is known as vector space semantics . 28 / 57
Words as Vectors: Example baby cat 29 / 57
Words as Vectors: Example baby pig mouse cat 30 / 57
Recommend
More recommend