learning sentence embeddings through tensor methods
play

Learning Sentence Embeddings through Tensor Methods Anima - PowerPoint PPT Presentation

Learning Sentence Embeddings through Tensor Methods Anima Anandkumar Joint work with Dr. Furong Huang .. ACL Workshop 2016 Representations for Text Understanding The weather is good. tree Her life spanned years of soccer incredible change


  1. Learning Sentence Embeddings through Tensor Methods Anima Anandkumar Joint work with Dr. Furong Huang .. ACL Workshop 2016

  2. Representations for Text Understanding The weather is good. tree Her life spanned years of soccer incredible change for women. Mary lived through an era of football liberating reform for women. Word Embedding Word Sequence Embedding Word embeddings: Incorporates short range relationships, Easy to train. Sentence embeddings: Incorporates long range relationships, hard to train.

  3. Various Frameworks for Sentence Embeddings Compositional Models (M. Iyyer etal ‘15, T. Kenter ‘16) Composition of word embedding vectors: usually simple averaging. Compositional operator (averaging weights) based on neural nets. Weakly supervised (only averaging weights based on labels) or strongly supervised (joint training). Paragraph Vector (Q. V. Le & T. Mikolov ‘14) Augmented representation of paragraph + word embeddings. Supervised framework to train paragraph vector. For both frameworks Pros: Simple and cheap to train. Can use existing word embeddings. Cons: Word order not incorporated. Supervised. Not universal.

  4. Skip thought Vectors for Sentence Embeddings Learn sentence embedding based on joint probability of words, represented using RNN.

  5. Skip thought Vectors for Sentence Embeddings Learn sentence embedding based on joint probability of words, represented using RNN. Pros: Incorporates word order, unsupervised, universal. Cons: Requires contiguous long text, lots of data, slow training time. Cannot use domain specific training. R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, S. Fidler, “ Skip-Thought Vectors, ” NIPS 2015

  6. Convolutional Models for Sentence Embeddings (N. Kalchbrenner, E. Grefenstette, P. Blunsom ‘14) Activation = Maps * A sample sentence = * Word order max-k pooling Word encoding Label = * * =

  7. Convolutional Models for Sentence Embeddings (N. Kalchbrenner, E. Grefenstette, P. Blunsom ‘14) Activation = Maps * A sample sentence = * Word order max-k pooling Word encoding Label = * * = Pros: Incorporates word order. Detect polysemy. Cons: Supervised training. Not universal.

  8. Convolutional Models for Sentence Embeddings (F. Huang & A. ‘15) Activation Maps * = + A sample sentence * Word order max-k pooling Word encoding Label * = + *

  9. Convolutional Models for Sentence Embeddings (F. Huang & A. ‘15) Activation Maps * = + A sample sentence * Word order max-k pooling Word encoding Label * = + * Pros: Word order, polysemy, unsupervised, universal. Cons: Difficulty in training.

  10. Intuition behind Convolutional Model Shift invariance natural in images: image templates in different locations. Dictionary elements Image

  11. Intuition behind Convolutional Model Shift invariance natural in images: image templates in different locations. Dictionary elements Image Shift invariance in language: phrase templates in different parts of the sentence

  12. Learning Convolutional Dictionary Models + = ∗ ∗ x f 1 w 1 f L w 2 Input x , phrase templates (filters) f 1 , f 2 , activations w 1 , w 2

  13. Learning Convolutional Dictionary Models + = ∗ ∗ x f 1 w 1 f L w 2 Input x , phrase templates (filters) f 1 , f 2 , activations w 1 , w 2 � f i ∗ w i � 2 Training objective: min f i ,w i � x − 2 i

  14. Learning Convolutional Dictionary Models + = ∗ ∗ x f 1 w 1 f L w 2 Input x , phrase templates (filters) f 1 , f 2 , activations w 1 , w 2 � f i ∗ w i � 2 Training objective: f i ,w i � x − min 2 i Challenges Nonconvex optimization: no guaranteed solution in general. Alternating minimization: Fix w i ’s to update f i ’s and viceversa. Not guaranteed to reach global optimum (or even a stationary point!) Expensive in large sample regime: needs updating of w i ’s.

  15. Convex vs. Non-convex Optimization Guarantees for mostly convex.. But non-convex is trending! Images taken from https://www.facebook.com/nonconvex

  16. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima Guaranteed approaches for reaching global optima?

  17. Non-convex Optimization in High Dimensions Critical/statitionary points: x : ∇ x f ( x ) = 0 . Curse of dimensionality: exponential number of critical points. Saddle points slow down improvement. Lack of stopping criteria for local search methods. local maxima Saddle points local minima Fast escape from saddle points in high dimensions?

  18. Outline Introduction 1 Why Tensors? 2 Tensor Decomposition Methods 3 Other Applications 4 Conclusion 5

  19. Example: Discovering Latent Factors Classics Physics Music Math List of scores for students in different tests Alice Learn hidden factors for Verbal and Mathematical Bob Intelligence [C. Spearman 1904] Carol Dave Eve Score (student,test) = student verbal-intlg × test verbal + student math-intlg × test math

  20. Matrix Decomposition: Discovering Latent Factors Classics Physics Music Math Verbal Math Alice Bob = + Carol Dave Eve Identifying hidden factors influencing the observations Characterized as matrix decomposition

  21. Matrix Decomposition: Discovering Latent Factors Classics Physics Music Math Verbal Math Alice Bob = + Carol Dave Eve = + Decomposition is not necessarily unique. Decomposition cannot be overcomplete.

  22. Tensor: Shared Matrix Decomposition Classics Physics Music Math Verbal Math Alice Bob = + (Oral) Carol Dave Eve Alice Bob = + (Written) Carol Dave Eve Shared decomposition with different scaling factors Combine matrix slices as a tensor

  23. Tensor Decomposition V erbal Math Written Oral Alice Bob = + Carol Dave Eve Math Classics Physics music Outer product notation: T = u ⊗ v ⊗ w + ˜ u ⊗ ˜ v ⊗ ˜ w � T i 1 ,i 2 ,i 3 = u i 1 · v i 2 · w i 3 + ˜ u i 1 · ˜ v i 2 · ˜ w i 3

  24. Identifiability under Tensor Decomposition + · · · = + T = v 1 ⊗ 3 + v 2 ⊗ 3 + · · · , Uniqueness of Tensor Decomposition [J. Kruskal 1977] Above tensor decomposition: unique when rank one pairs are linearly independent Matrix case: when rank one pairs are orthogonal

  25. Identifiability under Tensor Decomposition + · · · = + T = v 1 ⊗ 3 + v 2 ⊗ 3 + · · · , Uniqueness of Tensor Decomposition [J. Kruskal 1977] Above tensor decomposition: unique when rank one pairs are linearly independent Matrix case: when rank one pairs are orthogonal λ 2 a 2 λ 2 a 2 λ 2 a 2 λ 1 a 1 λ 1 a 1 λ 1 a 1

  26. Identifiability under Tensor Decomposition + · · · = + T = v 1 ⊗ 3 + v 2 ⊗ 3 + · · · , Uniqueness of Tensor Decomposition [J. Kruskal 1977] Above tensor decomposition: unique when rank one pairs are linearly independent Matrix case: when rank one pairs are orthogonal λ 2 a 2 λ 2 a 2 λ 2 a 2 λ 1 a 1 λ 1 a 1 λ 1 a 1

  27. Moment-based Estimation Matrix: Pairwise Moments E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . M = uu ⊤ is rank-1 and M i,j = u i u j . Tensor: Higher order Moments E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] . T = u ⊗ u ⊗ u is rank-1 and T i,j,k = u i u j u k .

  28. Moment forms for Linear Dictionary Models =

  29. Moment forms for Linear Dictionary Models = Independent components analysis (ICA) Independent coefficients, e.g. Bernoulli Gaussian. Can be relaxed to sparse coefficients with limited dependency. � Fourth order cumulant: M 4 = κ j a j ⊗ a j ⊗ a j ⊗ a j . j ∈ [ k ] .... = +

  30. Convolutional dictionary model + = = ∗ ∗ x F ∗ w ∗ x f ∗ w ∗ f ∗ w ∗ 1 1 L L (a) Convolutional model (b) Reformulated model � � Cir ( f i ) w i = F ∗ w ∗ f i ∗ w i = x = i i

  31. Moment forms and optimization � � Cir ( f i ) w i = F ∗ w ∗ f i ∗ w i = x = i i Assume coefficients w i are independent (convolutional ICA model) Cumulant tensor has decomposition with components F ∗ i . + = +...+ +...+ 1 ) ⊗ 3 1 ) ⊗ 3 shift ( F ∗ 2 ) ⊗ 3 ( F ∗ ( F ∗ 2 ) ⊗ 3 M 3 shift ( F ∗ Learning Convolutional model through Tensor Decomposition

  32. Outline Introduction 1 Why Tensors? 2 Tensor Decomposition Methods 3 Other Applications 4 Conclusion 5

  33. Notion of Tensor Contraction Extends the notion of matrix product Matrix product Tensor Contraction � � T ( u, v, · ) = Mv = u i v j T i,j, : v j M j i,j j = + = + + +

  34. Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i i3 i2 = + i1

  35. Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i Key observation: If b i , c i ’s are fixed, objective is linear in a i ’s. i3 i2 = + i1

  36. Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i Key observation: If b i , c i ’s are fixed, objective is linear in a i ’s. Tensor unfolding i3 i2 = + i1

  37. Tensor Decomposition - ALS � a i ⊗ b i ⊗ c i � 2 Objective: � T − 2 i Key observation: If b i , c i ’s are fixed, objective is linear in a i ’s. Tensor unfolding i2 = + i1

Recommend


More recommend