sea search gu guided lightly su super ervised ed training
play

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of - PowerPoint PPT Presentation

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of of Structured Prediction on Energy Networ orks Pedram Rooshenas Dongxu Zhang Gopal Sharma Andrew McCallum St Struc uctur ured d Predi diction We are interested to


  1. Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of of Structured Prediction on Energy Networ orks Pedram Rooshenas Dongxu Zhang Gopal Sharma Andrew McCallum

  2. St Struc uctur ured d Predi diction • We are interested to learn a function • X input variables • Y output variables • We can define as • For a Gibbs distribution:

  3. St Struc uctur ured d Predi diction n Ene nergy Networks (SP SPENs) • If is parameterized using a differentiable model such as a deep neural network: • We can find a local minimum of E using gradient descent • The energy networks express the correlation among input and output variables. • Traditionally graphical models are used for representing the correlation among output variables. • Inference is intractable for most of expressive graphical models

  4. En Energy Mod odels [picture from Altinel (2018)] [picture from Belanger (2016)]

  5. Tr Training SPENs • Structural SVM (Belanger and McCallum, 2016) • End-to-End (Belanger et al., 2017) • Value-based training (Gygliet al. 2017) • Inference Network (Lifu Tu and Kevin Gimpel, 2018) • Rank-Based Training (Rooshenas et al., 2018)

  6. In Indirect Supervisi sion • Data annotation is expensive, especially for structured outputs. • Domain knowledge as the source of supervision. • It can be written as reward functions evaluates a pair of input and output configuration into a scalar • value • For a given x, we are looking for the best y that maximize 6

  7. Se Search-Gu Guided ed Training We have a reward function that provides indirect supervision

  8. Se Search-Gu Guided ed Training We want to learn a smooth version of the reward function such that we can use gradient-descent inference at test time We have a reward function that provides indirect supervision

  9. Se Search-Gu Guided ed Training y 0 We sample a point from energy function using noisy gradient-descent inference

  10. Se Search-Gu Guided ed Training y 0 y 1 We sample a point from energy function using noisy gradient-descent inference

  11. Se Search-Gu Guided ed Training y 0 y 2 y 1 We sample a point from energy function using noisy gradient-descent inference

  12. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 We sample a point from energy function using noisy gradient-descent inference

  13. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 We sample a point from energy function using noisy gradient-descent inference

  14. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 We sample a point from energy function using noisy gradient-descent inference

  15. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 Then we project the sample to the domain of the reward function (the sample is a point in the simplex, but the domain of the reward function is often discrete, i.e., the vertices of the simplex)

  16. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 Then the search procedure uses the sample as input and returns an output structure by searching the reward function

  17. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 We expect that the two points have the same ranking on the reward function and negative of the energy function

  18. Se Search-Gu Guided ed Training y 0 y 2 Ranking violation y 3 y 1 y 4 y 5 We expect that the two points have the same ranking on the reward function and negative of the energy function

  19. Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 When we find a pair of points that violates the ranking constraints, we update the energy function towards reducing the violation

  20. Ta Task-Lo Loss as Reward Function fo for Multi-La Label Classification • The simplest form of indirect supervision is to use task-loss as reward function:

  21. Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 24

  22. Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 25

  23. Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 26

  24. Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 27

  25. En Energy Model author title ... e z i s r e t l i F Wei 0.9 0.05 ... Li Filter size 0.9 0.04 ... . 0.85 0.1 ... Tokens Deep 0.4 0.45 ... Energy Learning 0.1 0.8 ... for 0.05 0.9 ... ... ... ... ... Input Tag Convolutional layer Max pooling Multi-layer perceptron embedding distribution with multiple filters and and different concatenation window sizes

  26. Pe Performance on Citation Field Extraction

  27. Se Semi-Supe Supervised d Se Setting ng • Alternatively use the output of search and ground-truth label for training.

  28. Sha Shape pe Parser c(32,32,28) c(32,32,24) I - t(32,32,20) + Parsing

  29. Shape Sha pe Parser c(32,32,28) c(32,32,28) c(32,32,24) c(32,32,24) I - - t(32,32,20) t(32,32,20) + + Parsing Parsing Predict

  30. Sha Shape pe Parser c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,28) c(32,32,24) c(32,32,24) I O - - - - Graphic Engine t(32,32,20) t(32,32,20) t(32,32,20) t(32,32,20) + + + + Parsing Parsing Parsing Parsing Predict

  31. Sha Shape pe Parser c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,28) c(32,32,24) c(32,32,24) I O - - - - Graphic Engine t(32,32,20) t(32,32,20) t(32,32,20) t(32,32,20) + + + + Parsing Parsing Parsing Parsing Predict

  32. Sha Shape pe Parser Ene nergy Mode del ... - circle(16,16,12) circle(16,16,12) 0.8 ... 1e-5 triangle(32,48,16) 1e-5 1e-5 ... Program + 1e-5 ... 1e-3 circle(16,24,12) 0.01 ... 1e-5 ­ 1e-5 ... 0.9 Output Energy distribution CNN Input Convolutional layer Multi-layer perceptron image

  33. Se Search h Budg udget vs. Cons nstraint nts

  34. Pe Performance on Shape Pa Parser

  35. Co Conclusion and Future Directions • If a reward function exists to evaluate every structured output into a scalar value • We can use unlabled data for training structured prediction energy networks • Domain knowledge or non-differentiable pipelines can be used to define the reward functions. • The main ingredient for learning from the reward function is the search operator. • Here we only use simple search operators, but more complex search functions derived from domain knowledge can be used for complicated problems.

Recommend


More recommend