word embeddings in feedforward networks tagging and
play

Word Embeddings in Feedforward Networks; Tagging and Dependency - PowerPoint PPT Presentation

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks Michael Collins, Columbia University Overview Introduction Multi-layer feedforward networks Representing words as vectors (word


  1. Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks Michael Collins, Columbia University

  2. Overview ◮ Introduction ◮ Multi-layer feedforward networks ◮ Representing words as vectors (“word embeddings”) ◮ The dependency parsing problem ◮ Dependency parsing using a shift-reduce neural-network model

  3. Multi-Layer Feedforward Networks ◮ An integer d specifying the input dimension. A set Y of output labels with |Y| = K . ◮ An integer J specifying the number of hidden layers in the network. ◮ An integer m j for j ∈ { 1 . . . J } specifying the number of hidden units in the j ’th layer. ◮ A matrix W 1 ∈ R m 1 × d and a vector b 1 ∈ R m 1 associated with the first layer. ◮ For each j ∈ { 2 . . . J } , a matrix W j ∈ R m j × m j − 1 and a vector b j ∈ R m j associated with the j ’th layer. ◮ For each j ∈ { 1 . . . J } , a transfer function g j : R m j → R m j associated with the j ’th layer. ◮ A matrix V ∈ R K × m J and a vector γ ∈ R K specifying the parameters in the output layer.

  4. Multi-Layer Feedforward Networks (continued) ◮ Calculate output of first layer: z 1 ∈ R m 1 W 1 x i + b 1 = h 1 ∈ R m 1 g 1 ( z 1 ) = ◮ Calculate outputs of layers 2 . . . J : For j = 2 . . . J : z j ∈ R m j W j h j − 1 + b j = h j ∈ R m j g j ( z j ) = ◮ Calculate output value: V h J + b J l ∈ R K = q ∈ R K = LS ( l ) o ∈ R = − log q y i

  5. Overview ◮ Introduction ◮ Multi-layer feedforward networks ◮ Representing words as vectors (“word embeddings”) ◮ The dependency parsing problem ◮ Dependency parsing using a shift-reduce neural-network model

  6. An Example: Part-of-Speech Tagging Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . • There are many possible tags in the position ?? { NN, NNS, Vt, Vi, IN, DT, . . . } • The task: model the distribution p ( t j | t 1 , . . . , t j − 1 , w 1 . . . w n ) where t j is the j ’th tag in the sequence, w j is the j ’th word • The input to the neural network will be � t 1 . . . t j − 1 , w 1 . . . w n , j �

  7. One-Hot Encodings of Words, Tags etc. ◮ A dictionary D with size s ( D ) maps each word w in the vocabulary to an integer Index ( w, D ) in the range 1 . . . s ( D ) . Index ( the , D ) = 1 Index ( dog , D ) = 2 Index ( cat , D ) = 3 Index ( saw , D ) = 4 . . . ◮ For any word w , dictionary D , Onehot ( w, D ) maps a word w to a “one-hot vector” u = Onehot ( w, D ) ∈ R s ( D ) . We have = 1 for j = Index ( w, D ) u j u j = 0 otherwise

  8. One-Hot Encodings of Words, Tags etc. (continued) ◮ A dictionary D with size s ( D ) maps each word w in the vocabulary to an integer in the range 1 . . . s ( D ) . Index ( the , D ) = 1 Index ( dog , D ) = 2 Index ( cat , D ) = 3 . . . Onehot ( the , D ) = [1 , 0 , 0 , . . . ] Onehot ( dog , D ) = [0 , 1 , 0 , . . . ] Onehot ( cat , D ) = [0 , 0 , 1 , . . . ] . . .

  9. The Concatenation Operation ◮ Given column vectors v i ∈ R d i for i = 1 . . . n , z ∈ R d = Concat ( v 1 , v 2 , . . . v n ) where d = � n i =1 d i ◮ z is a vector formed by concatenating the vectors v 1 . . . v n ◮ z is a column vector of dimension � i d i

  10. The Concatenation Operation (continued) ◮ Given vectors v i ∈ R d i for i = 1 . . . n , z ∈ R d = Concat ( v 1 , v 2 , . . . v n ) where d = � n i =1 d i ◮ The Jacobians: ∂z ∂v i ∈ R d × d i have entries � ∂z � = 1 ∂v i j,k if j = k + � i ′ <i d i ′ , � ∂z � = 0 ∂v i j,k otherwise

  11. A Single-Layer Computational Network for Tagging Inputs: A training example x i = � t 1 . . . t j − 1 , w 1 . . . w n , j � , y i ∈ Y . A word dictionary D with size s ( D ) , a tag dictionary T with size s ( T ) . Parameters of a single-layer feedforward network. Computational Graph: − 2 ∈ R s ( T ) t ′ = Onehot ( t j − 2 , T ) − 1 ∈ R s ( T ) t ′ = Onehot ( t j − 1 , T ) − 1 ∈ R s ( D ) w ′ = Onehot ( w j − 1 , D ) 0 ∈ R s ( D ) w ′ = Onehot ( w j , D ) +1 ∈ R s ( D ) w ′ = Onehot ( w j +1 , D ) u ∈ R 2 s ( T )+3 s ( D ) Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 ) z = Wu + b, h = g ( z ) , l = V h + γ, q = LS ( l ) o = q y i

  12. The Number of Parameters − 2 ∈ R s ( T ) t ′ = Onehot ( t j − 2 , T ) . . . +1 ∈ R s ( D ) w ′ = Onehot ( w j +1 , D ) Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ u = +1 ) z ∈ R m = Wu + b . . . ◮ An example: s ( T ) = 50 (50 tags), s ( D ) = 10 , 000 (10,000 words), m = 1000 (1000 neurons in the single layer) ◮ Then W ∈ R m × (2 s ( T )+3 s ( D )) and m = 1000 , 2 s ( T ) + 3 s ( D ) = 30 , 100 , so there are m × (2 s ( T ) + 3 s ( D )) = 30 , 100 , 000 parameters in the matrix W

  13. An Example Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . s ( T ) − 2 ∈ R t ′ = Onehot ( t j − 2 , T ) s ( T ) − 1 ∈ R t ′ = Onehot ( t j − 1 , T ) s ( D ) − 1 ∈ R w ′ = Onehot ( w j − 1 , D ) s ( D ) 0 ∈ R w ′ = Onehot ( w j , D ) s ( D ) +1 ∈ R w ′ = Onehot ( w j +1 , D ) Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 ) u . . .

  14. Embedding Matrices ◮ Given a word w , a word dictionary D we can map w to a one-hot representation w ′ ∈ R s ( D ) × 1 = Onehot ( w, D ) ◮ Now assume we have an embedding dictionary E ∈ R e × s ( D ) where e is some integer. Typical values of e are e = 100 or e = 200 ◮ We can now map the one-hot representation w ′ to w ′′ = w ′ = E × Onehot ( w, D ) E ���� ���� ���� e × 1 e × s ( D ) s ( D ) × 1 ◮ Equivalently, a word w is mapped to a vector E (: j ) ∈ R e where j = Index ( w, D ) is the integer that word w is mapped to, and E (: j ) is the j ’th column in the matrix.

  15. Embedding Matrices vs. One-hot Vectors ◮ One-hot representation: w ′ ∈ R s ( D ) × 1 = Onehot ( w, D ) This representation is high-dimensional, sparse ◮ Embedding representation: w ′′ = w ′ = E × Onehot ( w, D ) E ���� ���� ���� e × 1 e × s ( D ) s ( D ) × 1 This representation is low-dimensional, dense ◮ The embedding matrices can be learned using stochastic gradient descent and backpropagation (each entry of E is a new parameter in the model) ◮ Critically, embeddings allow shared information between words: e.g., words with similar meaning or syntax get mapped to “similar” embeddings

  16. A Single-Layer Computational Network for Tagging Inputs: A training example x i = � t 1 . . . t j − 1 , w 1 . . . w n , j � , y i ∈ Y . A word dictionary D with size s ( D ) , a tag dictionary T with size s ( T ) . A word embedding matrix E ∈ R e × s ( D ) . A tag embedding matrix A ∈ R a × s ( D ) . Parameters of a single-layer feedforward network. Computational Graph: − 2 ∈ R a t ′ = A × Onehot ( t j − 2 , T ) − 1 ∈ R a t ′ = A × Onehot ( t j − 1 , T ) − 1 ∈ R e w ′ = E × Onehot ( w j − 1 , D ) 0 ∈ R e w ′ = E × Onehot ( w j , D ) +1 ∈ R e w ′ = E × Onehot ( w j +1 , D ) u ∈ R 2 a +3 e Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 ) = Wu + b, h = g ( z ) , l = V h + γ, q = LS ( l ) z o = q y i

  17. An Example Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/?? from which Spain expanded its empire into the rest of the Western Hemisphere . a − 2 ∈ R t ′ = A × Onehot ( t j − 2 , T ) a − 1 ∈ R t ′ = A × Onehot ( t j − 1 , T ) e − 1 ∈ R w ′ = E × Onehot ( w j − 1 , D ) e 0 ∈ R w ′ = E × Onehot ( w j , D ) e +1 ∈ R w ′ = E × Onehot ( w j +1 , D ) 2 a +3 e u ∈ R Concat ( t ′ − 2 , t ′ − 1 , w ′ − 1 , w ′ 0 , w ′ = +1 )

  18. Calculating Jacobians 0 ∈ R e w ′ = E × Onehot ( w, D ) Equivalently: � ( w ′ 0 ) j = E j,k × Onehot k ( w, D ) k ◮ Need to calculate the Jacobian ∂w ′ 0 E This has entries � ∂w ′ � 1 if j = j ′ and Onehot k ( w, E ) = 1 , 0 otherwise 0 = E j, ( j ′ ,k )

  19. An Additional Perspective − 2 ∈ R a − 2 ∈ R a t ′ t ′ = Onehot ( t j − 2 , T ) = A × Onehot ( t j − 2 , T ) . . . . . . +1 ∈ R e +1 ∈ R e w ′ w ′ = Onehot ( w j +1 , D ) = E × Onehot ( w j +1 , D ) Concat ( t ′ − 2 . . . w ′ Concat ( t ′ − 2 . . . w ′ u = +1 ) u ¯ = +1 ) ¯ z ∈ R m z ∈ R m = Wu + b ¯ = W ¯ u + b ◮ If we set ¯ W = W × Diag ( A, A, E, E, E ) ���� ���� � �� � m × (2 s ( T )+3 s ( E )) m × (2 a +3 e ) (2 a +3 e ) × (2 s ( T )+3 s ( D )) then Wu + b = ¯ W ¯ u + b hence z = ¯ z

  20. An Additional Perspective (continued) ◮ If we set ¯ = × Diag ( A, A, E, E, E ) W W ���� ���� � �� � m × (2 s ( T )+3 s ( E )) m × (2 a +3 e ) (2 a +3 e ) × (2 s ( T )+3 s ( D )) then Wu + b = ¯ W ¯ u + b hence z = ¯ z ◮ An example: s ( T ) = 50 (50 tags), s ( D ) = 10 , 000 (10,000 words), a = e = 100 (recall a , e are size of embeddings for tags and words respectively), m = 1000 (1000 neurons) ◮ Then we have parameters ¯ W vs. W A E ���� ���� ���� ���� 1000 × 30 , 100 1000 × 500 100 × 50 100 × 10 , 000

Recommend


More recommend