training strategies
play

Training Strategies CS 6355: Structured Prediction 1 So far we saw - PowerPoint PPT Presentation

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output prediction? Different ways for modeling structured prediction Conditional random fields, factor graphs, constraints What we only


  1. Training Strategies CS 6355: Structured Prediction 1

  2. So far we saw • What is structured output prediction? • Different ways for modeling structured prediction – Conditional random fields, factor graphs, constraints • What we only occasionally touched upon: – Algorithms for training and inference • Viterbi (inference in sequences) • Structured perceptron (training in general) 2

  3. Rest of the semester • Strategies for training – Structural SVM – Stochastic gradient descent – More on local vs. global training • Algorithms for inference – Exact inference – “Approximate” inference – Formulating inference problems in general • Latent/hidden variables, representations and such 3

  4. Up next • Structural Support Vector Machine – How it naturally extends multiclass SVM • Empirical Risk Minimization – Or: how structural SVM and CRF are solving very similar problems • Training Structural SVM via stochastic gradient descent – And some tricks 4

  5. Where are we? • Structural Support Vector Machine – How it naturally extends multiclass SVM • Empirical Risk Minimization – Or: how structural SVM and CRF are solving very similar problems • Training Structural SVM via stochastic gradient descent – And some tricks 5

  6. Recall: Binary and Multiclass SVM • Binary SVM – Maximize margin – Equivalently, Minimize norm of weights such that the closest points to the hyperplane have a score ± 1 • Multiclass SVM – Each label has a different weight vector (like one-vs-all) – Maximize multiclass margin – Equivalently, Minimize total norm of the weights such that the true label is scored at least 1 more than the second best one 6

  7. Multiclass SVM in the separable case We have a data set D = {< x i , y i >} Recall hard binary SVM 7

  8. Multiclass SVM in the separable case Size of the weights. We have a data set D = {< x i , y i >} Recall hard binary SVM Effectively, regularizer The score for the true label is higher than the score for any other label by 1 8

  9. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ * 𝐲, 𝐳 * *∈-./01 𝐲 9

  10. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ * 𝐲, 𝐳 * *∈-./01 𝐲 We also have a data set 𝐸 = {(𝐲 4 , 𝐳 4 )} 10

  11. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ 𝑞 𝐲, 𝐳 𝑞 𝑞∈parts 𝐲 We also have a data set 𝐸 = {(𝐲 𝑗 , 𝐳 𝑗 )} What we want from training (following the multiclass idea) For each training example (𝐲 4 , 𝐳 4 ) : The annotated structure 𝐳 4 gets the highest score among all structures – Or to be safe, 𝐳 4 gets a score that is at least one more than all other structures – 𝐱 ? Φ 𝐲 4 , 𝐳 4 ≥ 𝐱 ? Φ 𝐲 4 , 𝐳 + 1 ∀𝐳 ≠ 𝐳 4 , 11

  12. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ 𝑞 𝐲, 𝐳 𝑞 𝑞∈parts 𝐲 We also have a data set 𝐸 = {(𝐲 𝑗 , 𝐳 𝑗 )} What we want from training (following the multiclass idea) For each training example (𝐲 4 , 𝐳 4 ) : The annotated structure 𝐳 4 gets the highest score among all structures – Or to be safe, 𝐳 4 gets a score that is at least one more than all other structures – 𝐱 ? Φ 𝐲 4 , 𝐳 4 ≥ 𝐱 ? Φ 𝐲 4 , 𝐳 + 1 ∀𝐳 ≠ 𝐳 4 , 12

  13. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ 𝑞 𝐲, 𝐳 𝑞 𝑞∈parts 𝐲 We also have a data set 𝐸 = {(𝐲 𝑗 , 𝐳 𝑗 )} What we want from training (following the multiclass idea) For each training example (𝐲 4 , 𝐳 4 ) : The annotated structure 𝐳 4 gets the highest score among all structures – Or to be safe, 𝐳 4 gets a score that is at least one more than all other structures – 𝐱 ? Φ 𝐲 4 , 𝐳 4 ≥ 𝐱 ? Φ 𝐲 4 , 𝐳 + 1 ∀𝐳 ≠ 𝐳 4 , 13

  14. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ 𝑞 𝐲, 𝐳 𝑞 𝑞∈parts 𝐲 We also have a data set 𝐸 = {(𝐲 𝑗 , 𝐳 𝑗 )} What we want from training (following the multiclass idea) For each training example (𝐲 4 , 𝐳 4 ) : The annotated structure 𝐳 4 gets the highest score among all structures – Or to be safe, 𝐳 4 gets a score that is at least one more than all other structures – 𝐱 ? Φ 𝐲 4 , 𝐳 4 ≥ 𝐱 ? Φ 𝐲 4 , 𝐳 + 1 ∀𝐳 ≠ 𝐳 4 , 14

  15. � Structural SVM: First attempt Suppose we have some definition of a structure (a factor graph) – And feature definitions for each “part” 𝑞 as Φ 𝑞 (𝐲, 𝐳 𝑞 ) – Remember: we can talk about the feature vector for the entire structure • Features sum over the parts Φ 𝐲, 𝐳 = ) Φ 𝑞 𝐲, 𝐳 𝑞 𝑞∈parts 𝐲 We also have a data set 𝐸 = {(𝐲 𝑗 , 𝐳 𝑗 )} What we want from training (following the multiclass idea) For each training example (𝐲 4 , 𝐳 4 ) : The annotated structure 𝐳 4 gets the highest score among all structures – Or to be safe, 𝐳 4 gets a score that is at least one more than all other structures – 𝐱 ? Φ 𝐲 4 , 𝐳 4 ≥ 𝐱 ? Φ 𝐲 4 , 𝐳 + 1 ∀𝐳 ≠ 𝐳 4 , 15

  16. Structural SVM: First attempt Maximize margin For every Score for gold Score for other training example structure structure Some other structure 16

  17. Structural SVM: First attempt Maximize margin Input with gold Some other Score for gold Score for other structure structure 17

  18. Structural SVM: First attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure 18

  19. Structural SVM: First attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure Problem? 19

  20. Structural SVM: First attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure Problem Gold structure 20

  21. Structural SVM: First attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure Problem Gold structure Other structure A: Only one mistake Other structure B: Fully incorrect 21

  22. Structural SVM: First attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure Problem Gold structure Other structure A: Only one mistake Structure B has is more wrong, but this formulation will be happy if both A & B are scored one less than gold! Other structure B: Fully incorrect No partial credit! 22

  23. Structural SVM: Second attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure 23

  24. Structural SVM: Second attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure Hamming distance between structures: Counts the number of differences between them 24

  25. Structural SVM: Second attempt Maximize margin by minimizing norm of w Input with gold Some other Score for gold Score for other structure structure 25

  26. Structural SVM: Second attempt Intuition • It is okay for a structure that is close (in Hamming sense) to the true one to get a score that is close to the true structure • Structures that are very different from the true structure should get much lower scores 26

  27. Structural SVM: Second attempt Maximize margin by minimizing norm of w Intuition • It is okay for a structure that is close (in Hamming sense) to the true one to get a score that is close to the true structure • Structures that are very different from the true structure should get much lower scores 27

Recommend


More recommend