structured output learning with indirect supervision
play

Structured Output Learning with Indirect Supervision Ming-Wei Chang , - PowerPoint PPT Presentation

Structured Output Learning with Indirect Supervision Ming-Wei Chang , Vivek Srikumar, Dan Goldwasser and Dan Roth Computer Science Department, University of Illinois at Urbana-Champaign Page. 1/31 Review: structured output prediction Example


  1. Geometric Interpretation for SSVM Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! Page. 11/31

  2. Geometric Interpretation for SSVM w Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! Page. 11/31

  3. Geometric Interpretation for SSVM w Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! Predict: Φ( x 1 , ˆ h ) Page. 11/31

  4. Geometric Interpretation for SSVM Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! w Page. 11/31

  5. Structural SVM � w � 2 min + C 1 � L S ( x i , h i , w ) 2 w i ∈ S Regularization Measures the model complexity Structural Loss : S is the set of structured labeled examples: L S ( x i , h i , w ): Measures “the distance” between the current best prediction and the gold structure h i L S can use hinge or square hinge functions or others A convex optimization problem Page. 12/31

  6. Structural SVM � w � 2 min + C 1 � L S ( x i , h i , w ) 2 w i ∈ S Regularization Measures the model complexity Structural Loss : S is the set of structured labeled examples: L S ( x i , h i , w ): Measures “the distance” between the current best prediction and the gold structure h i L S can use hinge or square hinge functions or others A convex optimization problem Now, add supervision from the companion task! Page. 12/31

  7. The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Page. 13/31

  8. The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Companion Task : Does this example possess a good structure? Page. 13/31

  9. The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Companion Task : Does this example possess a good structure? x 1 is positive . There must exist a good structure that justifies the positive label ∃ h , w T Φ( x 1 , h ) ≥ 0 Page. 13/31

  10. The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Companion Task : Does this example possess a good structure? x 1 is positive . There must exist a good structure that justifies the positive label ∃ h , w T Φ( x 1 , h ) ≥ 0 x 2 is negative . No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0 Page. 13/31

  11. Why is binary labeled data useful? x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  12. Why is binary labeled data useful? { Φ( x 1 , h ) | h ∈ H ( x 1 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  13. Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  14. Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Predict: Φ( x 1 , ˆ h ) x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  15. Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  16. Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) { Φ( x 2 , h ) | h ∈ H ( x 2 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  17. Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) { Φ( x 2 , h ) | h ∈ H ( x 2 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  18. Why is binary labeled data useful? w : SSVM+Indirect Supervision { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) { Φ( x 2 , h ) | h ∈ H ( x 2 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  19. Why is binary labeled data useful? w : SSVM+Indirect Supervision { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) { Φ( x 2 , h ) | h ∈ H ( x 2 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

  20. Outline Motivation 1 Structured Output Prediction and Its Companion Task 2 J oint L earning with I ndirect S upervision 3 Optimization 4 Experiments 5 Page. 15/31

  21. Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task Page. 16/31

  22. Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Page. 16/31

  23. Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Goal: Goal: w T Φ( x i , h i ) ≥ max h ∈H ( x i ) w T Φ( x i , h ) . h ∈H ( x i ) w T Φ( x i , h ) ≥ 0 y i max Page. 16/31

  24. Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Goal: Goal: w T Φ( x i , h i ) ≥ max h ∈H ( x i ) w T Φ( x i , h ) . h ∈H ( x i ) w T Φ( x i , h ) ≥ 0 y i max Structural Loss: L S Binary Loss: L B Page. 16/31

  25. Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Goal: Goal: w T Φ( x i , h i ) ≥ max h ∈H ( x i ) w T Φ( x i , h ) . h ∈H ( x i ) w T Φ( x i , h ) ≥ 0 y i max Structural Loss: L S Binary Loss: L B Both L S and L B can use hinge, square-hinge, logistic, . . . Page. 16/31

  26. J oint L earning with I ndirect S upervision � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 � L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B Regularization : measures the model complexity Direct Supervision : structured labeled data S = { ( x , h ) } Indirect Supervision : binary labeled data B = { ( x , y ) } Page. 17/31

  27. J oint L earning with I ndirect S upervision � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 � L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B Regularization : measures the model complexity Direct Supervision : structured labeled data S = { ( x , h ) } Indirect Supervision : binary labeled data B = { ( x , y ) } Page. 17/31

  28. J oint L earning with I ndirect S upervision � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 � L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B Regularization : measures the model complexity Direct Supervision : structured labeled data S = { ( x , h ) } Indirect Supervision : binary labeled data B = { ( x , y ) } Share weight vector w Use the same weight vector for both structured labeled data and binary labeled data. Page. 17/31

  29. Outline Motivation 1 Structured Output Prediction and Its Companion Task 2 J oint L earning with I ndirect S upervision 3 Optimization 4 Experiments 5 Page. 18/31

  30. Convexity Properties � w � 2 � � min + C 1 L S ( x i , h i , w ) + C 2 L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B � � ∆( h , h i ) − w T Φ( x i , h i ) + w T Φ( x i , h ) � � L S ( x i , h i , w ) = ℓ max (1) h � � h ∈H ( x ) ( w T Φ B ( x i , h )) L B ( x i , y i , w ) = ℓ 1 − y i max (2) Page. 19/31

  31. Convexity Properties Regularization , Direct Supervision , Negative Data B − Convex Parts � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S � + C 2 i ∈ B + L B ( x i , y i , w ) Neither convex nor concave Positive Data B + Page. 19/31

  32. JLIS: optimization procedure Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1 . Still need to do inference for structured examples and negative examples 3: Repeat! Page. 20/31

  33. JLIS: optimization procedure Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1 . Still need to do inference for structured examples and negative examples 3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex. Page. 20/31

  34. JLIS: optimization procedure Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1 . Still need to do inference for structured examples and negative examples 3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex. Properties of the algorithm : Asymmetric nature Converting a non-convex problem into a series of smaller convex problems Inference allows incorporating constraints on the output space. (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Page. 20/31

  35. Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S + C 2 i ∈ B + L B ( x i , y i , w ) � Page. 21/31

  36. Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S ✞ ☎ � + C 2 L B ( x i , y i , w ) with fixed structures ✝ ✆ i ∈ B + Page. 21/31

  37. Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S ✞ ☎ � + C 2 L B ( x i , y i , w ) with fixed structures ✝ ✆ i ∈ B + Cutting plane method Find the “best structure” for examples in S and B − with the current w Add chosen structure into the cache and solve it again! Page. 21/31

  38. Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S ✞ ☎ � + C 2 L B ( x i , y i , w ) with fixed structures ✝ ✆ i ∈ B + Cutting plane method Find the “best structure” for examples in S and B − with the current w Add chosen structure into the cache and solve it again! Dual coordinate descent method Simple implementation with square (L2) hinge loss Page. 21/31

  39. Outline Motivation 1 Structured Output Prediction and Its Companion Task 2 J oint L earning with I ndirect S upervision 3 Optimization 4 Experiments 5 Page. 22/31

  40. Experimental Setting Tasks Task 1 : Phonetic alignment Task 2 : Part-of-speech Tagging Task 3 : Information Extraction Citation recognition Advertisement field recognition Companion Tasks Phonetic alignment : Transliteration pair or not POS Tagging : Has a legitimate POS tag sequence or not IE : Is a legitimate Citation/Advertisement or not Page. 23/31

  41. Experimental Results 80 Accuracy 70 PA : Phonetic Alignment 60 ADS : Advertisement field recognition PA POS Citation ADS Tasks Structural SVM Joint Learning with Indirect Supervision Page. 24/31

  42. Experimental Results 80 Accuracy 70 PA : Phonetic Alignment 60 ADS : Advertisement field recognition PA POS Citation ADS Tasks Structural SVM Joint Learning with Indirect Supervision Page. 24/31

  43. Impact of negative examples J-LIS: takes advantage of both positively and negatively labeled data Page. 25/31

  44. Impact of negative examples J-LIS: takes advantage of both positively and negatively labeled data Structural SVM 66 JLIS Accuracy 64 62 100 200 400 800 1 . 6 k 3 . 2 k 6 . 4 k 12 . 8 k 25 . 6 k all Number of tokens in the negative examples Page. 25/31

  45. Impact of negative examples J-LIS: takes advantage of both positively and negatively labeled data Structural SVM 66 JLIS Accuracy 64 62 100 200 400 800 1 . 6 k 3 . 2 k 6 . 4 k 12 . 8 k 25 . 6 k all Number of tokens in the negative examples Page. 25/31

  46. Comparison to other learning framework Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun 2004) S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Page. 26/31

  47. Comparison to other learning framework Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun 2004) S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Semi-Supervised Learning methods (Zien, Brefeld, and Scheffer 2007) : Transductive Structural SSVM, (Brefeld and Scheffer 2006) : co-Structural SVM J-LIS uses “negative” examples Page. 26/31

  48. Comparison to other learning framework Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun 2004) S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Semi-Supervised Learning methods (Zien, Brefeld, and Scheffer 2007) : Transductive Structural SSVM, (Brefeld and Scheffer 2006) : co-Structural SVM J-LIS uses “negative” examples Compared to Contrastive Estimation Conceptually related. More discussion Page. 26/31

  49. Conclusions It is possible to use binary labeled data for learning structures! J-LIS : gains from both direct and indirect supervision Similarly, structured labeled data can help the binary task Jump Allows the use of constraints on structures Page. 27/31

  50. Conclusions It is possible to use binary labeled data for learning structures! J-LIS : gains from both direct and indirect supervision Similarly, structured labeled data can help the binary task Jump Allows the use of constraints on structures Many exciting new directions! Using existing labeled dataset as structured task supervisions How to generate good “negative” examples? Other forms of indirect supervision? Page. 27/31

  51. Thank you! Thank you!! Our learning code is available: the JLIS package http://l2r.cs.uiuc.edu/~cogcomp/software.php Page. 28/31

  52. Compared to Contrastive Estimation: I Contrastive Estimation Performing unsupervised learning with log-linear models Maximize log P ( x ) Model 1 h exp( w T Φ( x , h )) � P ( x ) = � x exp( w T Φ(ˆ x , h )) h , ˆ CE � h exp( w T Φ( x , h )) P ( x ) = � x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ Page. 29/31

  53. Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Page. 30/31

  54. Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Page. 30/31

  55. Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Page. 30/31

  56. Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Property Can use existing data CE needs to know the relationship between “neighbors” of the input x . J-LIS can use existing binary labeled data. Page. 30/31

  57. Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Property Can use existing data CE needs to know the relationship between “neighbors” of the input x . J-LIS can use existing binary labeled data. Compared J-LIS and CE without using labeled data Jump Back Part-of-speech tags experiments. Same features and dataset. Random Base line: 35% EM: 60.9% (62.1%), CE: 74.7% (79.0%) J-LIS : 70.1% .J-LIS + 5 labeled example: 79.1% Page. 30/31

  58. Joint learning: Results 95 90 Accuracy on the binary classiciation 85 80 75 70 65 60 55 50 |S| = 10, init. only |S| = 10, joint 45 |S| = 20, init. only |S| = 20, joint 40 100 200 400 800 1600 The size of training data (|B|) Impact of structure labeled data when binary classification is our target. Results (for transliteration identification) show that joint training of direct and indirect supervision significantly improves performance, especially when direct supervision is scarce. Page. 31/31

Recommend


More recommend