Tractable Semi-Supervised Learning of Complex Structured Prediction Models Kai-Wei Chang University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya Keerthi S (Microsoft CISL) September 24, 2013 K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 1 / 23
Structured Prediction Problems (examples) Sequence Learning (e.g., input: a sentence; output: POS Tag) The President Came to the office DT N V P DT N Multi-label Classification (e.g., a document belongs to more than one class - finance, politics) ( Object → { c l 1 , c l 2 , c l K } ) In this paper, we consider general structures Characteristics: Exponential number of output combinations for a given input (e.g., 2 K in K output multi-label classification problem) Label dependency across the outputs K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 2 / 23
Semi-supervised Learning (SSL) Manual labeling is expensive Unlabeled data is freely available (e.g., web pages, mails) Additional domain knowledge or side information available ◮ Label distribution in the unlabeled data (e.g., 80% positive examples) ◮ Label correlation (e.g., multi-label classification problem) For SSL, we need inference engine that can handle domain constraints Make use of unlabeled data with domain knowledge or side information to constrain the solution space - improved performance K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 3 / 23
SSL of Complex Structured Prediction Models Most works assume that the output structure is simple (e.g., Dhillon et al 12, Chang et al 12) ⇒ Cannot handle problems with complex structure Contributions: We propose an approximate semi-supervised learning algorithm: ◮ use piecewise training for estimating the model weights ◮ dual decomposition method for inference problem ⇒ extend SSL to general structured prediction problems Our inference engine can be applied to various SSL frameworks K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 4 / 23
Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 5 / 23
Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 6 / 23
SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23
SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C Learning Problem: learn a scoring function s ( x , y ; θ ) = θ · f ( x , y ) where θ denotes model parameter and f ( · ) is the feature function K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23
SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C Learning Problem: learn a scoring function s ( x , y ; θ ) = θ · f ( x , y ) where θ denotes model parameter and f ( · ) is the feature function Inference Problem: y ∗ = argmax s ( x , y ; θ ) y ∈Y K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23
SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23
SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) Supervised Learning: max θ S ( θ ) = R ( θ ) + L ( Y L ; X L , θ ) Regularization: R ( θ ) = − || θ || 2 2 σ 2 ( σ 2 : regularization Parameter) Log Likelihood Function: L ( Y ; X , θ ) = 1 n log p ( Y | X ; θ ) = 1 � log p ( y i | x i ; θ ) n i K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23
SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) Supervised Learning: max θ S ( θ ) = R ( θ ) + L ( Y L ; X L , θ ) Regularization: R ( θ ) = − || θ || 2 2 σ 2 ( σ 2 : regularization Parameter) Log Likelihood Function: L ( Y ; X , θ ) = 1 n log p ( Y | X ; θ ) = 1 � log p ( y i | x i ; θ ) n i Semi-supervised Learning: max S ( θ ) + L ( Y U ; X U , θ ) s . t . label constraints on Y U θ, Y U K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23
Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23
Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23
Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large Decomposable scoring function � s ( · ) : s ( y ; x , θ ) = c φ c ( y π c ) where c is a component K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23
Decomposable Scoring Function (2) Decomposable Scoring Function s ( · ) � s ( y ; x , θ ) = φ c ( y π c ) c where c is a component Can we use a simplified likelihood model to learn the model parameters efficiently? Composite Likelihood Approach - Compose likelihood using likelihoods of individual components Can we use popular decomposition methods for solving inference problems with domain constraints efficiently? (e.g., dual decomposition) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 10 / 23
Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 11 / 23
Composite Likelihood Composite (log) likelihood ˜ � � � L ( y ; x , θ ) = c L c ( y π c ; x , θ ) = c φ c ( y π c ) − c log Z c π c ⊂ { 1 , . . . , N } is an index set associated with c . Key: Partition function in each component is easy to compute K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23
Composite Likelihood Composite (log) likelihood ˜ � � � L ( y ; x , θ ) = c L c ( y π c ; x , θ ) = c φ c ( y π c ) − c log Z c π c ⊂ { 1 , . . . , N } is an index set associated with c . Key: Partition function in each component is easy to compute Examples: Let y = ( y 1 , y 2 , ..., y K ) , y k ∈ { + , −} , decompose likelihood function using K spanning trees (involving all variables y ): ◮ Score of each tree φ k ( y π c ) = 1 p θ p ( y p ) · x + 1 � � q � = k θ pq ( y pq ) · x K 2 K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23
Recommend
More recommend