Learning with Structured Output Spaces Keerthiram Murugesan
Standard Predic,on • Find func8on from input space X to output space Y such that the predic8on error is low. x Microsoft announced today that they x acquired Apple for the amount equal to the y gross national product of Switzerland. 1 GATACAACCTATCCCCGTATATATATTCTA Microsoft officials stated that they first x TGGGTATAGTATTAAATCAATACAACCTAT y wanted to buy Switzerland, but eventually CCCCGTATATATATTCTATGGGTATAGTAT -1 were turned off by the mountains and the TAAATCAATACAACCTATCCCCGTATATAT snowy winters… y 7.3 ATTCTATGGGTATAGTATTAAATCAGATAC AACCTATCCCCGTATATATATTCTATGGGT ATAGTATTAAATCACATTTA (typically Y is “simple”)
��� � � ��� � � � ��� � � � � �� � � �� �� � �� �� � � � ��� � � � ��� � �� � ��� � �� � � � �� � ��� � � � ��� � � � ��� � �� � � �� �� �� � � Structured Predic,on X Y Y Y X X APPGEAYLQPGEAYLQV The dog chased the cat. [Obama]running S presidential election in the [presidental election] has position NP VP mobilized [ many this group young voters]. many young voters [His][position] on Det N V NP [climate change] Obama was well received His climate change by [this group]. Det N X X Y � Y � Y � Y Conservation Reservoir !"#$0' !"#$%&'()*'+,'-.%' 4' 5' !"#$%&'/)*'+,'-.%' Corridors 12+30' � Y � � Y � � Y �
� � �� � � �� �� �� � � ��� � ��� � � � ��� � � Talk Overview • Structured Predic8on (Quick Review) The dog chased the cat. S – Conven8onal Approach NP VP Det N V NP Det N • Structured Predic8on Cascades – Ensemble Cascades � Y • Ensemble learning for Structured Predic8on – Online algorithm – Boos8ng-style algorithm � Y �
Structured Predic8on
Structured Output Spaces • Input: x • Predict: y Y(x) Structured! ∈ • Quality determined by u8lity func8on Scoring • Conven,onal Approach: func,on – Train: learn model U(x,y) of u8lity – Test: predict via h ( x ) = argmax y ∈ Y ( x ) U ( x , y ) Can be challenging
Example: Sequence Predic8on • Part-of-Speech Tagging h ( x ) = argmax y ∈ Y ( x ) U ( x , y ) – Given a sequence of words x – Predict sequence of tags y. y x Det V Det N The rain wet the cat N y Det V V Adj V y Adv N V V Det …
Example: Sequence Predic8on • MAP inference in 1-st order Markov models 1 st order dynamics … y 1 y 2 y 3 y 4 … x 1 x 3 x 2 x 4 Similar models include CRFs, Kalman Filters, Linear Dynamical Systems, etc.
Example: Sequence Predic8on y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 n • Utility function: ∑ U ( x , y ) = u ( x t , y t , y t − 1 ) Sum over maximal cliques t = 1 n ∑ • Prediction: h ( x ) = argmax u ( x t , y t , y t − 1 ) y t = 1 Dynamic Programming
Scoring function as Linear Models • U/u is parameterized linearly: ∑ U ( x , y ; θ ) = u ( x t , y t , y t − 1 ; θ ) t Some feature u ( x , y 1 , y 2 ; θ ) = θ T f ( x , y 1 , y 2 ) representa,on θ T f ( x t , y t , y t − 1 ) ∑ h ( x ; θ ) = argmax y t Dynamic Programming
Feature representa8on
Generalizing to Other Structures • From last slide: θ T f ( x t , y t , y t − 1 ) ∑ h ( x ; θ ) = argmax y t • General Formulation: ∑ Ψ ( x , y ) = f ( x t , y t , y t − 1 ) • Viterbi t • CKY Parsing θ T Ψ ( x , y ) h ( x ; θ ) = argmax • Sorting • Belief Propagation y • Integer Programming
Learning Se]ng λ 2 + ℓ y , h ( x ; θ ) ∑ ( ) argmin 2 θ h ( x ) = argmax y ∈ Y ( x ) U ( x , y ) θ ( x , y ) Regulariza,on Loss Func,on • Generaliza8on of Conven8onal Se]ngs – Hinge loss = Structural SVMs – Log-loss = Condi8onal Random Fields – Gradient Descent, Cu]ng Plane, etc… • Requires running inference during training
Restric8on: Increased Complexity
Restric8on: Pre-specified Structure h ( x ; θ ) = argmax U ( x , y ; θ ) y Structure • Learn a (linearly) parameterized U – Such that h(x) gives good predic8ons • What if U is “wrong”? – Known to not be consistent – Infinite training data ≠ converging to best model
Summary: Structured Predic8on h ( x ; θ ) = argmax U ( x , y ; θ ) y Structure • Conven8onal Approach – Specify structure & inference procedure – Train parameters on training set {(x,y)} • Limita,ons: – Run,me propor,onal to Model Complexity – Structure Mismatch & Inconsistency
Structured Predic8on Cascades
Classifier Cascades (Face Classifier)
Classifier Cascades
Tradeoffs in Cascaded Learning • Accuracy : Minimize the number of errors incurred by each level • Efficiency : Maximize the number of filtered assignments at each level
Structured Predic8on Cascades
Clique Assignments • Valid assignment for clique (Y k-1 ,Y k ) Remember Sum over Adj N Cliques? ∑ U ( x , y ) = u ( x t , y t , y t − 1 ) Y k-1 Y k c ∈ C • Invalid assignment (that will be eliminated/ pruned) N N Y k-1 Y k
Clique Assignments • Valid assignment for clique (Y k-1 ,Y k ) How do we know this Adj N assignment is good or bad? 1. Score 2. Threshold Y k-1 Y k • Invalid assignment (that will be eliminated/ pruned) N N Y k-1 Y k
Max-marginal score (sequence models)
Threshold (t)
Threshold (t)
Threshold (t)
Learning θ at each cascade level
Online learning
Structured Predic8on Ensembles
Ensemble Learning h 1 h 2 h 3 h p face face no face face Goal: Combine these output from mul8ple models / hypotheses / experts: 1) Majority Vo8ng 2) Linear combina8on of hypotheses/experts 3) Boos8ng, etc
Weighted Majority Algorithm
Ensemble learning for Structured Predic8on h 1 h 2 h 3 h p . . . h 1 1 h 1 2 . . . h 1 l h p 1 h p 2 . . . h p l h 1 Adv N V V Det
Example: Sequence Model
Weighted Majority Algorithm for Structured Predic8on Ensembles
Ensemble output from Weighted Majority Algorithm • Given W 1 , W 2 , … W T
Boos8ng for Structured Predic8on Ensembles
Ensemble output from Boos8ng • Given the base learners h 1 , h 2 , … h T, : • Note h 1 , h 2 , … h T are different from h 1 , h 2 , … h P
• THE END
Recommend
More recommend