learning task specific bilexical embeddings
play

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , - PowerPoint PPT Presentation

Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya Bilexical Relations Increasing interest in


  1. Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya

  2. Bilexical Relations ◮ Increasing interest in bilexical relations (relation between pairs of words) ◮ Dependency Parsing - lexical items (words) connected by binary relations ROOT OBJ NMOD NMOD SUBJ Small birds sing loud songs ◮ Bilexical Predictions can be modelled as Pr(modifier | head) 2

  3. In Focus: Unseen words Adjective-Noun relation, where an adjective modifies a noun NMOD? NMOD? Vynil can be applied to electronic devices and cases ◮ If one/more of the above nouns or adjectives have not been observed in the supervision: estimating Pr(adjective | noun) ◮ Zipf distribution ◮ Generalisation is a challenge 3

  4. Distributional Word Space Models ◮ Distributional Hypothesis: Linguistic items with similar distributions have similar meanings the curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon over the trees in front But I could nt see the moon or the stars , only the rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh oud obscured part of the moon . The Allied guns behind ◮ For every word we can compute an n -dimensional vector space representation φ ( w ) → R n from a large corpus 4

  5. Contributions Formulation of statistical models to improve bilexical prediction tasks ◮ Supervised framework to learn bilexical models over distributional representations ⇒ based on learning bilinear forms ◮ Compressing representations by imposing low-rank constraints to bilinear forms ◮ Lexical embeddings tailored for a specific bilexical task. 5

  6. Overview Bilexical Models Low Rank Constraints Learning Experiments 6

  7. Overview Bilexical Models Low Rank Constraints Learning Experiments 7

  8. Unsupervised Bilexical Models ◮ We can define a simple bilexical model as: exp {� φ ( m ) , φ ( h ) �} Pr( m | h ) = � m ′ exp {� φ ( m ′ ) , φ ( h ) �} where � φ ( x ) , φ ( y ) � denotes the inner-product. ◮ Problem: Designing appropriate contexts for required relations ◮ Solution: Leverage supervised training corpus 8

  9. Supervised bilexical model ◮ We define the bilexical model in a bilinear setting here as: φ (m) ⊤ W φ (h) where: φ ( m ) and φ ( h ) are n − dimensional representations of m and h W ∈ R n × n is a matrix of parameters 9

  10. Interpreting the Bilinear Models ◮ If we write the bilinear model as: n n � � f i , j ( m , h ) W i , j i =1 j =1 ◮ f i , j ( m , h ) = φ ( m ) [ i ] φ ( h ) [ j ] ⇒ Bilinear models are linear models, with an extended feature space! ◮ = ⇒ We can re-use all the algorithms designed for linear models. 10

  11. Using Bilexical Models ◮ We define the bilexical operator as: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } ⇒ Standard conditional log-linear model 11

  12. Overview Bilexical Models Low Rank Constraints Learning Experiments 12

  13. Rank Constraints φ (m) ⊤ W φ (h)      · · · · · · · · · w 11 w 12 w 1 n h 1    w 21 w 22 · · · · · · · · · w 2 n h 2            . . . . . . .  � �      . . . . . . .  · · · · · · · · · m 1 m 2 m n . . . . . . .          φ ( h ) . . . . . . . � �� �  . . . . . .   .  φ ( m ) ⊤ . . . . . . .            . . . . . . .   . . . . . .   .    . . . . . . .        · · · · · · · · ·  w n 1 w n 2 w nn h n 13

  14. Rank Constraints ◮ Factorizing W :     u 11 · · · u 1 k      u 21 · · · w 2 k      σ 1 · · · 0 v 11 · · · · · · v 1 n        h 1  . . .  . . . . . .   . . .  ...       . . . . . .  . . .   h 2  . . . . . .         � �      · · · · · ·    m 1 m 2 m n . . . .   . . .   0 · · · · · · · · ·    σ k v k 1 v kn . . . . . φ ( h )       � �� �     φ ( m ) ⊤ � �� � � �� � . · · ·  u n 1 u nn   V ⊤  .   Σ .       �� �  �   U h n    � �� � SVD( W ) = U Σ V ⊤ ◮ Please note: W has rank k 14

  15. Low Rank Embedding ◮ Regrouping, we get:         · · · u 11 u 1 k h 1     u 21 · · · w 2 k h 2         · · · 0 · · · · · · σ 1 v 11 v 1 n         . . . .     . .  . . . .    � � . . . ... .  . .   . . . .  · · · · · · . . . . m 1 m 2 m n         . . . . . .             . . . .   . . .     .   0 · · · · · · · · · σ k v k 1 v kn . . . .         � �� � · · · u n 1 u nn h n Σ � �� � � �� � φ ( m ) ⊤ U V ⊤ φ ( h ) ◮ We can see φ ( m ) ⊤ U as a projection of m and V ⊤ φ ( h ) as a projection of h ◮ ⇒ Rank ( W ) defines the dimesionality of the induced space, hence the embedding 15

  16. Computational Properties ◮ In many tasks, given a head, rank a huge number of modifiers ◮ Strategy: ◮ Project each lexical item in the vocabulary into its low dimensional embedding of size k ◮ Compute the bilexical score as k − dimensional inner product ◮ Substantial computational gain as long as we obtain low-rank models 16

  17. Summary ◮ Induce high dimensional representation from a huge corpus ◮ Learn embeddings suited for a given task ◮ Our bilexical formulation is, in principle, a linear model but with an extended features space ◮ Low rank bilexical embedding is computationally efficient 17

  18. Overview Bilexical Models Low Rank Constraints Learning Experiments 18

  19. Formulation ◮ Given: ◮ Set of training tuples D = ( m 1 , h 1 ) . . . ( m l , h l ) ◮ where m are modifiers and h are heads ◮ the distributional representations: φ ( m ) and φ ( h ) is computed over some corpus ◮ We set it as a conditional log-linear distribution: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } 19

  20. Learning and Regularization ◮ Std. conditional Max. Likelihood optimization; maximize the log-likelihood function: � � � � φ ( m ) ⊤ W φ ( h ) − log φ ( m ′ ) ⊤ W φ ( h ) log Pr( D ) = exp ( m , h ) ∈D m ′ ∈M ◮ Adding regularization penalty, our algorithm essentially maximizes: � log Pr( m | h )) + λ � W � p ( m , h ) ∈D ◮ Regularization using the proximal gradient method (FOBOS): ◮ ℓ 1 Regularization, � W � 1 ⇒ Sparse feature space ◮ ℓ 2 Regularization, � W � 2 ⇒ Dense parameters ◮ ℓ ∗ Regularization, � W � ∗ ⇒ Low Rank Embedding 20

  21. Algorithm: Proximal Algorithm for Bilexical Operators 1 while iteration < MaxIteration do W t +0 . 5 = W t − η t g ( W t ); // gradient of neg log-likelihood 2 /* adding regularization penalty: */ /* W t +1 = argmin W || W t +0 . 5 − W || 2 2 + η t λ r ( W ) */ /* we use proximal operator */ if ℓ 1 regularizer then 3 W t +1 ( i , j ) = sign ( W t +0 . 5 ( i , j )) · max( W t +0 . 5 ( i , j ) − η t λ, 0); 4 // Basic thresholding operation else if ℓ 2 regularizer then 5 1 W t +1 = 1+ η t λ W t +0 . 5 ; // Basic scaling operation 6 else if nuclear norm regularizer then 7 W t +0 . 5 = U Σ V ⊤ ; 8 σ i = max( σ i − η t λ, 0); // σ i = the i -th element on Σ ¯ 9 Σ V ⊤ ; W t +1 = U ¯ 10 11 end 21

  22. Overview Bilexical Models Low Rank Constraints Learning Experiments 22

Recommend


More recommend