Complex Prediction Problems A novel approach to multiple Structured - PowerPoint PPT Presentation

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin Altun Max-Planck Institute ECML HLIE08 Yasemin Altun Complex Prediction

Information Extraction Extract structured information from unstructured data Typical subtasks Named Entity Recognition: person, location, organization names Coreference Identification: noun phrases refering to the same object Relation extraction: eg. Person works for Organization Ultimate tasks Document Summarization Question Answering Yasemin Altun Complex Prediction

Complex Prediction Problems Complex tasks consisting of multiple structured subtasks Real world problems too complicated for solving at once Ubiquitous in many domains Natural Language Processing Computational Biology Computational Vision Yasemin Altun Complex Prediction

Complex Prediction Example Motion Tracking in Computational Vision Subtask: Identify joint angles of human body Yasemin Altun Complex Prediction

Complex Prediction Example 3-D protein structure prediction in Computational Biology Subtask: Identify secondary structured Prediction from amino-acid sequence AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW AAYKSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHSDTYSGQW 1!! ) Yasemin Altun Complex Prediction

Standard Approach to Complex Prediction Pipeline Approach Define intermediate/sub-tasks Solve them individually or in a cascaded manner Use output of subtasks as features (input) for target task y 01 y 03 NER y 02 POS y 11 y 12 y 13 x 1 x 3 x 2 2 X where for POS and for NER where x : x +POS tags Problems: Error propagation No learning across tasks Yasemin Altun Complex Prediction

New Approach to Complex Prediction Proposed approach: Solve tasks jointly discriminatively Decompose multiple structured tasks Use methods from multitask learning Good predictors are it smooth Restrict the search space for smooth functions of all tasks Device targeted approximation methods Standard approximation algorithms do not capture specifics Dependencies within tasks are stronger than dependencies across tasks Advantages Less/no error propagation Enables learning across tasks Yasemin Altun Complex Prediction

Structured Output (SO) Prediction Supervised Learning Given input/output pairs ( x , y ) ∈ X × Y Y = { 0 , . . . , m } , Y = ℜ Data from unknown/fixed distribution D over X × Y Goal: Learn a mapping h : X → Y State-of-the art are discriminative, eg. SVMs, Boosting In Structured Output prediction, Multivariate response variable with structural dependency. |Y| : exponential in number of variables Sequences, tree, hierarchical classification, ranking Yasemin Altun Complex Prediction

SO Prediction Generative framework: Model P ( x , y ) Advantages: Efficient learning and inference algorithms Disadvantages: Harder problem, Questionable independence assumption, Limited representation Local approaches: eg. [Roth, 2001] Advantages: Efficient algorithms Disadvantages: Ignore/problematic long range dependencies Discriminative learning Advantages: Richer representation via kernels, capture dependencies Disadvantages: Expensive computation (SO prediction involves iteratively computing marginals or best labeling during training) Yasemin Altun Complex Prediction

Formal Setting Given S = (( x 1 , y 1 ) , . . . , ( x l , y n )) Find h : X → Y , h ( x ) = argmax y F ( x , y ) Linear discriminant function F : X × Y → R F w ( x , y ) = � ψ ( x , y ) , w � Cost function: ∆( y , y ′ ) ≥ 0 eg. 0-1 loss, Hamming loss Canonical example: Label sequence learning, where both x and y are sequences Yasemin Altun Complex Prediction

Maximum Margin Learning [Altun et al 03] Define separation margin [Crammer & Singer 01] γ i = F w ( x i , y i ) − max F w ( x i , y ) y � = y i Maximize min i γ i with small � w � 6)7$-$8"&&&&&&&&&&&&&& i max y � = y i ( 1 + F w ( x i , y ) − F w ( x i , y i )) + + λ � w � 2 Minimize � 2 Yasemin Altun Complex Prediction

Max-Margin Learning (cont.) � ( 1 + F w ( x i , y ) − F w ( x i , y i )) + + λ � w � 2 max 2 y � = y i i A convex non -quadratic program 1 2 + C � 2 � w � 2 min ξ i n w ,ξ i s . t . � w , ψ ( x i , y i ) � − max y � = y � w , ψ ( x i , y ) � ≥ 1 − ξ i , ∀ i Yasemin Altun Complex Prediction

Max-Margin Learning (cont.) � ( 1 + F w ( x i , y ) − F w ( x i , y i )) + + λ � w � 2 max 2 y � = y i i A convex quadratic program 1 2 + C � 2 � w � 2 min ξ i n w ,ξ i s . t . � w , ψ ( x i , y i ) � − � w , ψ ( x i , y ) � ≥ 1 − ξ i , ∀ i , ∀ y � = y i Number of constraints exponential Sparsity: Only a few of the constraints will be active Yasemin Altun Complex Prediction

Max-Margin Dual Problem Using Lagrangian techniques, the dual: max − 1 � � α i ( y ) α j ( y ′ ) δψ ( x i , y ) δψ ( x j , y ′ ) + α i ( y ) 2 i , j , y , y ′ i , y α i ( y ) ≤ C � s.t. 0 ≤ α i ( y ) , n , ∀ i y � = y i where δψ ( x i , y ) = ψ ( x i , y i ) − ψ ( x i , y ) Use the structure of equality constraints Replace the inner product with a kernel for implicit non-linear mapping Yasemin Altun Complex Prediction

Max-Margin Optimization Exploit sparseness and the structure of constraints by incrementally adding constraints (cutting plane algorithm) Maintain a working set Y i ⊆ Y for each training instance Iterate over training instance Incrementally augment (or shrink) working set Y i ˆ y = argmax F ( x i , y ) via Dynamic Programming y ∈Y− y i F ( x i , y i ) − F ( x i , ˆ y ) ≤ 1 − ǫ ? Optimize over Lagrange multipliers α i of Y i Yasemin Altun Complex Prediction

Max-Margin Cost Sensitivity Cost function ∆ : Y × Y → ℜ Multiclass 0/1 loss Sequences Hamming loss Parsing (1-F1) Extend max-margin framework for cost sensitivity (Taskar et.al. 2004) max y � = y i (∆( y i , y ) + F w ( x i , y ) − F w ( x i , y i )) + (Tsochantaridis et.al. 2004) max y � = y i ∆( y i , y )( 1 + F w ( x i , y ) − F w ( x i , y i )) + Yasemin Altun Complex Prediction

Example: Sequences "*+ Viterbi decoding for argmax operation Decompose features into time ψ ( x , y ) = � t ( ψ ( x t , y t ) + ψ ( y t , y t − 1 )) Two types of features Observation-label: ψ ( x t , y t ) = φ ( x t ) ⊗ Λ( y t ) Label-label: ψ ( y t , y t − 1 ) = Λ( y t ) ⊗ Λ( y t − 1 ) Yasemin Altun Complex Prediction

Example: Sequences (cont.) Inner product between two features separately � ψ ( x , y ) , ψ (¯ x , ¯ y )) � � � φ ( x t ) , φ (¯ x s ) � δ ( y t , ¯ y s ) + δ ( y t , ¯ y s ) δ ( y t − 1 , ¯ = y s − 1 ) s , t y s )) + ˜ � k (( x t , y t ) , (¯ x s , ¯ k (( y t , y t − 1 ) , (¯ y s , ¯ = y s − 1 )) s , t Arbitrary kernels on x Linear kernels on y Yasemin Altun Complex Prediction

Other SO Prediction Methods Find w to minimize expected loss E ( x , y ) ∼ D [∆( y , h f ( x ))] l w ∗ = argmin � L ( x i , y i , w ) + λ � w � 2 w i = 1 Loss functions Hinge loss Log-loss: CRF [Lafferty et al 2001] � L ( x , y , f ) = − F ( x , y ) + log exp ( F ( x , ˆ y )) ˆ y ∈Y Exp-loss: Structured Boosting [Altun et al 2002] � L ( x , y , f ) = exp ( F ( x , ˆ y ) − F ( x , y )) y ∈Y ˆ Yasemin Altun Complex Prediction

Complex Prediction via SO Prediction Possible Solution: Treat complex prediction as a loopy graph and use standard approximation methods Shortcomings: No knowledge of graph structure No knowledge that tasks defined over same input space Solution: Dependencies within tasks more important than dependencies across tasks. Use this for approximation method Restrict function class for each task via learning across tasks Yasemin Altun Complex Prediction

Joint Learning of Multiple SO prediction [Altun 2008] Tasks 1 , . . . , m Learn a discriminative function T : X → Y 1 × . . . Y m � � � � F ℓℓ ′ ( y ℓ , y ℓ ′ ; w , ¯ T ( x , y ; w , ¯ F ℓ ( x , y ℓ ; w ℓ ) + w ) = w ) . ℓ ℓ ′ where w ℓ capture dependencies within individual tasks ¯ w ℓ,ℓ ′ capture dependencies across tasks F ℓ defined as before F ℓℓ ′ linear functions wrt cliques assignments of tasks ℓ , ℓ ′ Yasemin Altun Complex Prediction

Joint Learning of Multiple SO prediction Assume a low dimensional representation Θ shared across all tasks [Argyriou et al 2007] � � w ℓσ , Θ T ψ ( x , y ℓ ) F ℓ ( x , y ℓ ; w ℓ , Θ) = Find T by discovering Θ and learning w , ¯ w m n � � L ℓ ( x i , y ℓ w ˆ r (Θ) + r ( w ) + ¯ r ( ¯ i ; w , ¯ min w ) + w , Θ) , Θ , w , ¯ ℓ = 1 i = 1 r , ¯ r regularization, eg. L 2 norm ˆ r , eg. Frobenius norm, trace norm L loss function, eg. Log-loss, hinge-loss Optimization is not jointly convex over Θ and w , ¯ w Yasemin Altun Complex Prediction

Joint Learning of Multiple SO prediction Via a reformulation, we get a jointly convex optimization m n � A ℓσ , D + A ℓσ � � L ℓ ( x i , y ℓ � � + ¯ r ( ¯ i ; A , ¯ min w ) + w ) . A , D , ¯ w ℓσ ℓ = 1 i = 1 Optimize iteratively wrt A , ¯ w and D Closed form solution for D Yasemin Altun Complex Prediction

Complex Prediction Problems A novel approach to multiple Structured - PowerPoint PPT Presentation

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin Altun Max-Planck Institute ECML HLIE08 Yasemin Altun Complex Prediction Information Extraction Extract structured information from unstructured

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Math 211 Math 211 Complex Numbers and Matrices October 29, 2001 2 Complex Numbers Complex

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

Multilingual Dependency Analysis with a Two-Stage Discriminative Parser R. McDonald and K. Lerman

Surveillance Event Detection(SED) Yu Cheng *, Lisa Brown , Quanfu Fan , Rogerio Feris ,

The Naproche system: Proof-checking mathematical texts in controlled natural language Marcos

PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain Adaptation, Transfer Learning, Data

Ranking Related News Predictions Nattiya Kanhabua 1 , Roi Blanco 2 and Michael Matthews 2 1

!,/&"012,)"'34,&"',%.'5$."6'7/62"88$%&' @ANU ML Workshop,

FOSTERING CREATIVITY Letting go of the control of the visual performance and allowing your

Complex Prediction Problems A novel approach to multiple Structured - PowerPoint PPT Presentation

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin Altun Max-Planck Institute ECML HLIE08 Yasemin Altun Complex Prediction Information Extraction Extract structured information from unstructured

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

An introduction to complex numbers The complex numbers Are the real numbers not sufficient? A

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Overview of Complex Networks Complex Networks Principles of Complex Systems | @pocsvox Basic

Complex Networks Principles of Complex Systems Basic definitions Examples of CSYS/MATH 300,

Why Complex-Valued When Are Integration . . . Relation to Complex . . . Fuzzy? Why Complex

Math 211 Math 211 Complex Numbers and Matrices October 29, 2001 2 Complex Numbers Complex

Complex Networks Basic definitions Principles of Complex Systems Books Course 300, Fall, 2008

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

Multilingual Dependency Analysis with a Two-Stage Discriminative Parser R. McDonald and K. Lerman

Surveillance Event Detection(SED) Yu Cheng *, Lisa Brown , Quanfu Fan , Rogerio Feris ,

The Naproche system: Proof-checking mathematical texts in controlled natural language Marcos

PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain Adaptation, Transfer Learning, Data

Ranking Related News Predictions Nattiya Kanhabua 1 , Roi Blanco 2 and Michael Matthews 2 1

!,/&amp;&quot;012,)&quot;'34,&amp;&quot;',%.'5$.&quot;6'7/62&quot;88$%&amp;' @ANU ML Workshop,

FOSTERING CREATIVITY Letting go of the control of the visual performance and allowing your

!,/&"012,)"'34,&"',%.'5$."6'7/62"88$%&' @ANU ML Workshop,