Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin Altun Max-Planck Institute ECML HLIE08 Yasemin Altun Complex Prediction
Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity Recognition: person, location, organization names Coreference Identification: noun phrases refering to the same object Relation extraction: eg. Person works for Organization Ultimate tasks Document Summarization Question Answering Yasemin Altun Complex Prediction
Complex Prediction Problems Complex tasks consisting of multiple structured subtasks Real world problems too complicated for solving at once Ubiquitous in many domains Natural Language Processing Computational Biology Computational Vision Yasemin Altun Complex Prediction
Complex Prediction Example Motion Tracking in Computational Vision Subtask: Identify joint angles of human body Yasemin Altun Complex Prediction
Complex Prediction Example 3-D protein structure prediction in Computational Biology Subtask: Identify secondary structured Prediction from amino-acid sequence AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW 1!! ) Yasemin Altun Complex Prediction
Standard Approach to Complex Prediction Pipeline Approach Define intermediate/sub-tasks Solve them individually or in a cascaded manner Use output of subtasks as features (input) for target task y 01 y 03 NER y 02 POS y 11 y 12 y 13 x 1 x 3 x 2 2 X where for POS and for NER where x : x +POS tags Problems: Error propagation No learning across tasks Yasemin Altun Complex Prediction
New Approach to Complex Prediction Proposed approach: Solve tasks jointly discriminatively Decompose multiple structured tasks Use methods from multitask learning Good predictors are it smooth Restrict the search space for smooth functions of all tasks Device targeted approximation methods Standard approximation algorithms do not capture specifics Dependencies within tasks are stronger than dependencies across tasks Advantages Less/no error propagation Enables learning across tasks Yasemin Altun Complex Prediction
Structured Output (SO) Prediction Supervised Learning Given input/output pairs ( x , y ) ∈ X × Y Y = { 0 , . . . , m } , Y = ℜ Data from unknown/fixed distribution D over X × Y Goal: Learn a mapping h : X → Y State-of-the art are discriminative, eg. SVMs, Boosting In Structured Output prediction, Multivariate response variable with structural dependency. |Y| : exponential in number of variables Sequences, tree, hierarchical classification, ranking Yasemin Altun Complex Prediction
SO Prediction Generative framework: Model P ( x , y ) Advantages: Efficient learning and inference algorithms Disadvantages: Harder problem, Questionable independence assumption, Limited representation Local approaches: eg. [Roth, 2001] Advantages: Efficient algorithms Disadvantages: Ignore/problematic long range dependencies Discriminative learning Advantages: Richer representation via kernels, capture dependencies Disadvantages: Expensive computation (SO prediction involves iteratively computing marginals or best labeling during training) Yasemin Altun Complex Prediction
Formal Setting Given S = (( x 1 , y 1 ) , . . . , ( x l , y n )) Find h : X → Y , h ( x ) = argmax y F ( x , y ) Linear discriminant function F : X × Y → R F w ( x , y ) = � ψ ( x , y ) , w � Cost function: ∆( y , y ′ ) ≥ 0 eg. 0-1 loss, Hamming loss Canonical example: Label sequence learning, where both x and y are sequences Yasemin Altun Complex Prediction
Maximum Margin Learning [Altun et al 03] Define separation margin [Crammer & Singer 01] γ i = F w ( x i , y i ) − max F w ( x i , y ) y � = y i Maximize min i γ i with small � w � 6)7$-$8"&&&&&&&&&&&&&& i max y � = y i ( 1 + F w ( x i , y ) − F w ( x i , y i )) + + λ � w � 2 Minimize � 2 Yasemin Altun Complex Prediction
Max-Margin Learning (cont.) � ( 1 + F w ( x i , y ) − F w ( x i , y i )) + + λ � w � 2 max 2 y � = y i i A convex non -quadratic program 1 2 + C � 2 � w � 2 min ξ i n w ,ξ i s . t . � w , ψ ( x i , y i ) � − max y � = y � w , ψ ( x i , y ) � ≥ 1 − ξ i , ∀ i Yasemin Altun Complex Prediction
Max-Margin Learning (cont.) � ( 1 + F w ( x i , y ) − F w ( x i , y i )) + + λ � w � 2 max 2 y � = y i i A convex quadratic program 1 2 + C � 2 � w � 2 min ξ i n w ,ξ i s . t . � w , ψ ( x i , y i ) � − � w , ψ ( x i , y ) � ≥ 1 − ξ i , ∀ i , ∀ y � = y i Number of constraints exponential Sparsity: Only a few of the constraints will be active Yasemin Altun Complex Prediction
Max-Margin Dual Problem Using Lagrangian techniques, the dual: max − 1 � � α i ( y ) α j ( y ′ ) δψ ( x i , y ) δψ ( x j , y ′ ) + α i ( y ) 2 i , j , y , y ′ i , y α i ( y ) ≤ C � s.t. 0 ≤ α i ( y ) , n , ∀ i y � = y i where δψ ( x i , y ) = ψ ( x i , y i ) − ψ ( x i , y ) Use the structure of equality constraints Replace the inner product with a kernel for implicit non-linear mapping Yasemin Altun Complex Prediction
Max-Margin Optimization Exploit sparseness and the structure of constraints by incrementally adding constraints (cutting plane algorithm) Maintain a working set Y i ⊆ Y for each training instance Iterate over training instance Incrementally augment (or shrink) working set Y i ˆ y = argmax F ( x i , y ) via Dynamic Programming y ∈Y− y i F ( x i , y i ) − F ( x i , ˆ y ) ≤ 1 − ǫ ? Optimize over Lagrange multipliers α i of Y i Yasemin Altun Complex Prediction
Max-Margin Cost Sensitivity Cost function ∆ : Y × Y → ℜ Multiclass 0/1 loss Sequences Hamming loss Parsing (1-F1) Extend max-margin framework for cost sensitivity (Taskar et.al. 2004) max y � = y i (∆( y i , y ) + F w ( x i , y ) − F w ( x i , y i )) + (Tsochantaridis et.al. 2004) max y � = y i ∆( y i , y )( 1 + F w ( x i , y ) − F w ( x i , y i )) + Yasemin Altun Complex Prediction
Example: Sequences "*+ Viterbi decoding for argmax operation Decompose features into time ψ ( x , y ) = � t ( ψ ( x t , y t ) + ψ ( y t , y t − 1 )) Two types of features Observation-label: ψ ( x t , y t ) = φ ( x t ) ⊗ Λ( y t ) Label-label: ψ ( y t , y t − 1 ) = Λ( y t ) ⊗ Λ( y t − 1 ) Yasemin Altun Complex Prediction
Example: Sequences (cont.) Inner product between two features separately � ψ ( x , y ) , ψ (¯ x , ¯ y )) � � � φ ( x t ) , φ (¯ x s ) � δ ( y t , ¯ y s ) + δ ( y t , ¯ y s ) δ ( y t − 1 , ¯ = y s − 1 ) s , t y s )) + ˜ � k (( x t , y t ) , (¯ x s , ¯ k (( y t , y t − 1 ) , (¯ y s , ¯ = y s − 1 )) s , t Arbitrary kernels on x Linear kernels on y Yasemin Altun Complex Prediction
Other SO Prediction Methods Find w to minimize expected loss E ( x , y ) ∼ D [∆( y , h f ( x ))] l w ∗ = argmin � L ( x i , y i , w ) + λ � w � 2 w i = 1 Loss functions Hinge loss Log-loss: CRF [Lafferty et al 2001] � L ( x , y , f ) = − F ( x , y ) + log exp ( F ( x , ˆ y )) ˆ y ∈Y Exp-loss: Structured Boosting [Altun et al 2002] � L ( x , y , f ) = exp ( F ( x , ˆ y ) − F ( x , y )) y ∈Y ˆ Yasemin Altun Complex Prediction
Complex Prediction via SO Prediction Possible Solution: Treat complex prediction as a loopy graph and use standard approximation methods Shortcomings: No knowledge of graph structure No knowledge that tasks defined over same input space Solution: Dependencies within tasks more important than dependencies across tasks. Use this for approximation method Restrict function class for each task via learning across tasks Yasemin Altun Complex Prediction
Joint Learning of Multiple SO prediction [Altun 2008] Tasks 1 , . . . , m Learn a discriminative function T : X → Y 1 × . . . Y m � � � � F ℓℓ ′ ( y ℓ , y ℓ ′ ; w , ¯ T ( x , y ; w , ¯ F ℓ ( x , y ℓ ; w ℓ ) + w ) = w ) . ℓ ℓ ′ where w ℓ capture dependencies within individual tasks ¯ w ℓ,ℓ ′ capture dependencies across tasks F ℓ defined as before F ℓℓ ′ linear functions wrt cliques assignments of tasks ℓ , ℓ ′ Yasemin Altun Complex Prediction
Joint Learning of Multiple SO prediction Assume a low dimensional representation Θ shared across all tasks [Argyriou et al 2007] � � w ℓσ , Θ T ψ ( x , y ℓ ) F ℓ ( x , y ℓ ; w ℓ , Θ) = Find T by discovering Θ and learning w , ¯ w m n � � L ℓ ( x i , y ℓ w ˆ r (Θ) + r ( w ) + ¯ r ( ¯ i ; w , ¯ min w ) + w , Θ) , Θ , w , ¯ ℓ = 1 i = 1 r , ¯ r regularization, eg. L 2 norm ˆ r , eg. Frobenius norm, trace norm L loss function, eg. Log-loss, hinge-loss Optimization is not jointly convex over Θ and w , ¯ w Yasemin Altun Complex Prediction
Joint Learning of Multiple SO prediction Via a reformulation, we get a jointly convex optimization m n � A ℓσ , D + A ℓσ � � L ℓ ( x i , y ℓ � � + ¯ r ( ¯ i ; A , ¯ min w ) + w ) . A , D , ¯ w ℓσ ℓ = 1 i = 1 Optimize iteratively wrt A , ¯ w and D Closed form solution for D Yasemin Altun Complex Prediction
Recommend
More recommend