Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1 , 2) , Ariadna Quattoni (1 , 2) (1) Universitat Polit` (2) Xerox Research Centre Europe ecnica de Catalunya
Bilexical Relations ◮ Increasing interest in bilexical relations (relation between pairs of words) ◮ Dependency Parsing - lexical items (words) connected by binary relations ROOT OBJ NMOD NMOD SUBJ Small birds sing loud songs ◮ Bilexical Predictions can be modelled as Pr(modifier | head) 2
In Focus: Unseen words Adjective-Noun relation, where an adjective modifies a noun NMOD? NMOD? Vynil can be applied to electronic devices and cases ◮ If one/more of the above nouns or adjectives have not been observed in the supervision: estimating Pr(adjective | noun) ◮ Zipf distribution ◮ Generalisation is a challenge 3
Distributional Word Space Models ◮ Distributional Hypothesis: Linguistic items with similar distributions have similar meanings the curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon over the trees in front But I could nt see the moon or the stars , only the rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh oud obscured part of the moon . The Allied guns behind ◮ For every word we can compute an n -dimensional vector space representation φ ( w ) → R n from a large corpus 4
Contributions Formulation of statistical models to improve bilexical prediction tasks ◮ Supervised framework to learn bilexical models over distributional representations ⇒ based on learning bilinear forms ◮ Compressing representations by imposing low-rank constraints to bilinear forms ◮ Lexical embeddings tailored for a specific bilexical task. 5
Overview Bilexical Models Low Rank Constraints Learning Experiments 6
Overview Bilexical Models Low Rank Constraints Learning Experiments 7
Unsupervised Bilexical Models ◮ We can define a simple bilexical model as: exp {� φ ( m ) , φ ( h ) �} Pr( m | h ) = � m ′ exp {� φ ( m ′ ) , φ ( h ) �} where � φ ( x ) , φ ( y ) � denotes the inner-product. ◮ Problem: Designing appropriate contexts for required relations ◮ Solution: Leverage supervised training corpus 8
Supervised bilexical model ◮ We define the bilexical model in a bilinear setting here as: φ (m) ⊤ W φ (h) where: φ ( m ) and φ ( h ) are n − dimensional representations of m and h W ∈ R n × n is a matrix of parameters 9
Interpreting the Bilinear Models ◮ If we write the bilinear model as: n n � � f i , j ( m , h ) W i , j i =1 j =1 ◮ f i , j ( m , h ) = φ ( m ) [ i ] φ ( h ) [ j ] ⇒ Bilinear models are linear models, with an extended feature space! ◮ = ⇒ We can re-use all the algorithms designed for linear models. 10
Using Bilexical Models ◮ We define the bilexical operator as: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } ⇒ Standard conditional log-linear model 11
Overview Bilexical Models Low Rank Constraints Learning Experiments 12
Rank Constraints φ (m) ⊤ W φ (h) · · · · · · · · · w 11 w 12 w 1 n h 1 w 21 w 22 · · · · · · · · · w 2 n h 2 . . . . . . . � � . . . . . . . · · · · · · · · · m 1 m 2 m n . . . . . . . φ ( h ) . . . . . . . � �� � . . . . . . . φ ( m ) ⊤ . . . . . . . . . . . . . . . . . . . . . . . . . . . . · · · · · · · · · w n 1 w n 2 w nn h n 13
Rank Constraints ◮ Factorizing W : u 11 · · · u 1 k u 21 · · · w 2 k σ 1 · · · 0 v 11 · · · · · · v 1 n h 1 . . . . . . . . . . . . ... . . . . . . . . . h 2 . . . . . . � � · · · · · · m 1 m 2 m n . . . . . . . 0 · · · · · · · · · σ k v k 1 v kn . . . . . φ ( h ) � �� � φ ( m ) ⊤ � �� � � �� � . · · · u n 1 u nn V ⊤ . Σ . �� � � U h n � �� � SVD( W ) = U Σ V ⊤ ◮ Please note: W has rank k 14
Low Rank Embedding ◮ Regrouping, we get: · · · u 11 u 1 k h 1 u 21 · · · w 2 k h 2 · · · 0 · · · · · · σ 1 v 11 v 1 n . . . . . . . . . . � � . . . ... . . . . . . . · · · · · · . . . . m 1 m 2 m n . . . . . . . . . . . . . . 0 · · · · · · · · · σ k v k 1 v kn . . . . � �� � · · · u n 1 u nn h n Σ � �� � � �� � φ ( m ) ⊤ U V ⊤ φ ( h ) ◮ We can see φ ( m ) ⊤ U as a projection of m and V ⊤ φ ( h ) as a projection of h ◮ ⇒ Rank ( W ) defines the dimesionality of the induced space, hence the embedding 15
Computational Properties ◮ In many tasks, given a head, rank a huge number of modifiers ◮ Strategy: ◮ Project each lexical item in the vocabulary into its low dimensional embedding of size k ◮ Compute the bilexical score as k − dimensional inner product ◮ Substantial computational gain as long as we obtain low-rank models 16
Summary ◮ Induce high dimensional representation from a huge corpus ◮ Learn embeddings suited for a given task ◮ Our bilexical formulation is, in principle, a linear model but with an extended features space ◮ Low rank bilexical embedding is computationally efficient 17
Overview Bilexical Models Low Rank Constraints Learning Experiments 18
Formulation ◮ Given: ◮ Set of training tuples D = ( m 1 , h 1 ) . . . ( m l , h l ) ◮ where m are modifiers and h are heads ◮ the distributional representations: φ ( m ) and φ ( h ) is computed over some corpus ◮ We set it as a conditional log-linear distribution: � � φ ( m ) ⊤ W φ ( h ) exp Pr( m | h ) = � m ′ ∈M exp { φ ( m ′ ) ⊤ W φ ( h ) } 19
Learning and Regularization ◮ Std. conditional Max. Likelihood optimization; maximize the log-likelihood function: � � � � φ ( m ) ⊤ W φ ( h ) − log φ ( m ′ ) ⊤ W φ ( h ) log Pr( D ) = exp ( m , h ) ∈D m ′ ∈M ◮ Adding regularization penalty, our algorithm essentially maximizes: � log Pr( m | h )) + λ � W � p ( m , h ) ∈D ◮ Regularization using the proximal gradient method (FOBOS): ◮ ℓ 1 Regularization, � W � 1 ⇒ Sparse feature space ◮ ℓ 2 Regularization, � W � 2 ⇒ Dense parameters ◮ ℓ ∗ Regularization, � W � ∗ ⇒ Low Rank Embedding 20
Algorithm: Proximal Algorithm for Bilexical Operators 1 while iteration < MaxIteration do W t +0 . 5 = W t − η t g ( W t ); // gradient of neg log-likelihood 2 /* adding regularization penalty: */ /* W t +1 = argmin W || W t +0 . 5 − W || 2 2 + η t λ r ( W ) */ /* we use proximal operator */ if ℓ 1 regularizer then 3 W t +1 ( i , j ) = sign ( W t +0 . 5 ( i , j )) · max( W t +0 . 5 ( i , j ) − η t λ, 0); 4 // Basic thresholding operation else if ℓ 2 regularizer then 5 1 W t +1 = 1+ η t λ W t +0 . 5 ; // Basic scaling operation 6 else if nuclear norm regularizer then 7 W t +0 . 5 = U Σ V ⊤ ; 8 σ i = max( σ i − η t λ, 0); // σ i = the i -th element on Σ ¯ 9 Σ V ⊤ ; W t +1 = U ¯ 10 11 end 21
Overview Bilexical Models Low Rank Constraints Learning Experiments 22
Recommend
More recommend