Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of of Structured Prediction on Energy Networ orks Pedram Rooshenas Dongxu Zhang Gopal Sharma Andrew McCallum
St Struc uctur ured d Predi diction • We are interested to learn a function • X input variables • Y output variables • We can define as • For a Gibbs distribution:
St Struc uctur ured d Predi diction n Ene nergy Networks (SP SPENs) • If is parameterized using a differentiable model such as a deep neural network: • We can find a local minimum of E using gradient descent • The energy networks express the correlation among input and output variables. • Traditionally graphical models are used for representing the correlation among output variables. • Inference is intractable for most of expressive graphical models
En Energy Mod odels [picture from Altinel (2018)] [picture from Belanger (2016)]
Tr Training SPENs • Structural SVM (Belanger and McCallum, 2016) • End-to-End (Belanger et al., 2017) • Value-based training (Gygliet al. 2017) • Inference Network (Lifu Tu and Kevin Gimpel, 2018) • Rank-Based Training (Rooshenas et al., 2018)
In Indirect Supervisi sion • Data annotation is expensive, especially for structured outputs. • Domain knowledge as the source of supervision. • It can be written as reward functions evaluates a pair of input and output configuration into a scalar • value • For a given x, we are looking for the best y that maximize 6
Se Search-Gu Guided ed Training We have a reward function that provides indirect supervision
Se Search-Gu Guided ed Training We want to learn a smooth version of the reward function such that we can use gradient-descent inference at test time We have a reward function that provides indirect supervision
Se Search-Gu Guided ed Training y 0 We sample a point from energy function using noisy gradient-descent inference
Se Search-Gu Guided ed Training y 0 y 1 We sample a point from energy function using noisy gradient-descent inference
Se Search-Gu Guided ed Training y 0 y 2 y 1 We sample a point from energy function using noisy gradient-descent inference
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 We sample a point from energy function using noisy gradient-descent inference
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 We sample a point from energy function using noisy gradient-descent inference
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 We sample a point from energy function using noisy gradient-descent inference
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 Then we project the sample to the domain of the reward function (the sample is a point in the simplex, but the domain of the reward function is often discrete, i.e., the vertices of the simplex)
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 Then the search procedure uses the sample as input and returns an output structure by searching the reward function
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 We expect that the two points have the same ranking on the reward function and negative of the energy function
Se Search-Gu Guided ed Training y 0 y 2 Ranking violation y 3 y 1 y 4 y 5 We expect that the two points have the same ranking on the reward function and negative of the energy function
Se Search-Gu Guided ed Training y 0 y 2 y 3 y 1 y 4 y 5 When we find a pair of points that violates the ranking constraints, we update the energy function towards reducing the violation
Ta Task-Lo Loss as Reward Function fo for Multi-La Label Classification • The simplest form of indirect supervision is to use task-loss as reward function:
Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 24
Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 25
Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 26
Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction 27
En Energy Model author title ... e z i s r e t l i F Wei 0.9 0.05 ... Li Filter size 0.9 0.04 ... . 0.85 0.1 ... Tokens Deep 0.4 0.45 ... Energy Learning 0.1 0.8 ... for 0.05 0.9 ... ... ... ... ... Input Tag Convolutional layer Max pooling Multi-layer perceptron embedding distribution with multiple filters and and different concatenation window sizes
Pe Performance on Citation Field Extraction
Se Semi-Supe Supervised d Se Setting ng • Alternatively use the output of search and ground-truth label for training.
Sha Shape pe Parser c(32,32,28) c(32,32,24) I - t(32,32,20) + Parsing
Shape Sha pe Parser c(32,32,28) c(32,32,28) c(32,32,24) c(32,32,24) I - - t(32,32,20) t(32,32,20) + + Parsing Parsing Predict
Sha Shape pe Parser c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,28) c(32,32,24) c(32,32,24) I O - - - - Graphic Engine t(32,32,20) t(32,32,20) t(32,32,20) t(32,32,20) + + + + Parsing Parsing Parsing Parsing Predict
Sha Shape pe Parser c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,24) c(32,32,28) c(32,32,28) c(32,32,24) c(32,32,24) I O - - - - Graphic Engine t(32,32,20) t(32,32,20) t(32,32,20) t(32,32,20) + + + + Parsing Parsing Parsing Parsing Predict
Sha Shape pe Parser Ene nergy Mode del ... - circle(16,16,12) circle(16,16,12) 0.8 ... 1e-5 triangle(32,48,16) 1e-5 1e-5 ... Program + 1e-5 ... 1e-3 circle(16,24,12) 0.01 ... 1e-5 1e-5 ... 0.9 Output Energy distribution CNN Input Convolutional layer Multi-layer perceptron image
Se Search h Budg udget vs. Cons nstraint nts
Pe Performance on Shape Pa Parser
Co Conclusion and Future Directions • If a reward function exists to evaluate every structured output into a scalar value • We can use unlabled data for training structured prediction energy networks • Domain knowledge or non-differentiable pipelines can be used to define the reward functions. • The main ingredient for learning from the reward function is the search operator. • Here we only use simple search operators, but more complex search functions derived from domain knowledge can be used for complicated problems.
Recommend
More recommend