Backpropagating through Structured Argmax using a SPIGOT Hao Peng, Sam Thomson, Noah A. Smith @ACL July 17, 2018
Overview Shareholders took their money Parser arg max Shareholders took their money Downstream task Loss L
Overview Shareholders took their money Parser arg max Head token Yang and Mitchell, 2017 Tree-RNN Shareholders took their money Tai et al., 2015 Graph CNN Kipf and Welling, 2017 … Downstream task Loss L
Overview Shareholders took their money Parser arg max Shareholders took their money A layer in the computation graph? Downstream task Loss L
Overview Shareholders took their money Parser Non-di ff erentiable arg max Shareholders took their money A layer in the computation graph? Downstream task Loss L
Overview Aim Shareholders took their money • Structured prediction as a layer. Intermediate parser θ Motivation arg max • Structures help. Shareholders took their money ? Ji and Smith, 2017; Oepen et al., 2017 r θ L Downstream task • Linguistic structures may not be Loss L universally optimal. Williams, 2017
Overview Aim Shareholders took their money • Structured prediction as a layer. Intermediate parser θ Motivation arg max • Structures help. Shareholders took their money ? Ji and Smith, 2017; Oepen et al., 2017 r θ L Downstream task • Linguistic structures may not be Loss L universally optimal. Williams, 2017 Challenges • argmax is non-di ff erentiable.
Overview Aim Shareholders took their money • Structured prediction as a layer. Intermediate parser θ Motivation arg max • Structures help. Shareholders took their money ? Ji and Smith, 2017; Oepen et al., 2017 r θ L Downstream task • Linguistic structures may not be Loss L A proxy universally optimal. Williams, 2017 Method Challenges S tructured P rediction I ntermediate G radients O ptimization T echnique • argmax is non-di ff erentiable. SPIGOT
Outline ❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments
Structured Prediction Reviewed Input Shareholders took their money Output Shareholders took their money
Structured Prediction Reviewed Input Shareholders took their money Score S θ ( ) Shareholders took their money = X s θ ( ) head mod arcs
Structured Prediction Reviewed Input Shareholders took their money Score ⇤ > ⇥ s θ = s θ ( ) , s θ ( ) , s θ ( ) , . . . , s θ ( ) their took took their took money their money ] > z = [ 1? , 0? , 1? , 0? . . . , Output z > s θ arg max s.t. z forms a tree ˆ z Shareholders took their money
Linear Programming Formulation ˆ Shareholders took their money z = s θ ( ) their money s θ ( ) took their arg max z > s θ ( ) took money . s.t. z forms a tree . . s θ ( ) their took Az ≤ b Roth and Yih, 2004; Martins et al., 2009
Linear Programming Formulation ˆ Shareholders took their money z = s θ ( ) their money s θ ( ) took their arg max z > s θ ( ) took money . s.t. z forms a tree . . s θ ( ) their took z i ∈ { 0 , 1 } relaxation z i ∈ [0 , 1] Az ≤ b Roth and Yih, 2004; Martins et al., 2009
Outline ❖ Background: structured prediction as linear programs ❖ Method: SPIGOT algorithm ❖ Experiments
Backprop s θ ( ) their money s θ ( ) took their arg max z > r θ L ˆ s θ ( ) z = took money . s.t. z forms a tree . . s θ ( ) their took ˆ took Shareholders their money z Downstream task Loss L
Backprop s θ ( ) their money s θ ( ) took their arg max z > r θ L ˆ s θ ( ) z = took money . s.t. z forms a tree . . s θ ( ) their took ˆ took Shareholders their money z z L r ˆ Downstream task Backprop Loss L
Backprop s θ ( ) their money s θ ( ) took their arg max z > r θ L ˆ Backprop s θ ( ) z = took money . s.t. z forms a tree . . r s L s θ ( ) their took ˆ took Shareholders their money z z L r ˆ Downstream task Backprop Loss L
Backprop s θ ( ) their money s θ ( ) took their arg max z > r θ L ˆ Backprop s θ ( ) z = took money . s.t. z forms a tree . . r s L s θ ( ) their took Proxy ˆ took Shareholders their money z z L r ˆ Downstream task Backprop Loss L
Backprop We have: r ˆ We need: r s L z L
Backprop We have: r ˆ We need: r s L z L Leibniz, 1676 “ ” r s L = J r ˆ z L
Backprop We have: r ˆ We need: r s L z L Leibniz, 1676 “ ” r s L = J r ˆ z L z = arg max z > s θ ˆ s.t. z forms a tree Jacobian not defined
Backprop We have: r ˆ We need: r s L z L Leibniz, 1676 “ ” r s L = J r ˆ z L Straight-through Estimator (STE) Hinton, 2012; Bengio et al., 2013 r s L , r ˆ z L
Some Geometry… Straight-through Estimator (STE): r s L , r ˆ z L Az ≤ b z = [1 , 0 , 1 , · · · , 0] > ˆ Shareholders took their money
Some Geometry… Straight-through Estimator (STE): r s L , r ˆ z L Az ≤ b z L = [ � 0 . 3 , 0 . 5 , 0 . 4 , . . . , 0 . 2] �r ˆ z = [1 , 0 , 1 , · · · , 0] > ˆ Shareholders took their money
Some Geometry… Straight-through Estimator (STE): r s L , r ˆ z L z L z � r ˆ p = ˆ Shareholders took their money Az ≤ b z L = [ � 0 . 3 , 0 . 5 , 0 . 4 , . . . , 0 . 2] �r ˆ z = [1 , 0 , 1 , · · · , 0] > ˆ Shareholders took their money
Some Geometry… SPIGOT z L z � r ˆ p = ˆ Shareholders took their money q Az ≤ b z L = [ � 0 . 3 , 0 . 5 , 0 . 4 , . . . , 0 . 2] �r ˆ z = [1 , 0 , 1 , · · · , 0] > ˆ Shareholders took their money
Some Geometry… SPIGOT z L z � r ˆ p = ˆ Shareholders took their money q Az ≤ b z L = [ � 0 . 3 , 0 . 5 , 0 . 4 , . . . , 0 . 2] �r ˆ �r s L z = [1 , 0 , 1 , · · · , 0] > ˆ Shareholders took their money p = ˆ z � r ˆ z L q = proj( p ) r s L , ˆ z � q
Some Geometry… SPIGOT z L z � r ˆ ˆ z L z � r ˆ ˆ �r s L �r s L ˆ ˆ z z
Algorithm Input took Shareholders their money s θ ( ) their money s θ ( ) took their ˆ arg max z > Parser θ z = s θ ( ) took money . s.t. z forms a tree . . s θ ( ) their took ˆ Shareholders took their money z
Algorithm Input took Shareholders their money s θ ( ) their money s θ ( ) took their ˆ arg max z > Parser θ z = s θ ( ) took money . s.t. z forms a tree . . s θ ( ) their took ˆ Shareholders took their money z Downstream task φ Loss L
Algorithm Input took Shareholders their money s θ ( ) their money s θ ( ) took their ˆ arg max z > Parser θ z = s θ ( ) took money . s.t. z forms a tree . . s θ ( ) their took z L ˆ r ˆ Shareholders took their money z Downstream task φ Backprop Loss L
Algorithm Input took Shareholders their money s θ ( ) their money s θ ( ) took their ˆ arg max z > Parser θ z = s θ ( ) took money p = ˆ z � r ˆ z L . r s L s.t. z forms a tree . q = proj( p ) . s θ ( ) r s L , ˆ their took z � q Project onto z L ˆ r ˆ Shareholders took their money z Downstream task φ Backprop Loss L
Algorithm Input took Shareholders their money r θ L s θ ( ) their money s θ ( ) took their ˆ Backprop arg max z > Parser θ z = s θ ( ) took money p = ˆ z � r ˆ z L . r s L s.t. z forms a tree . q = proj( p ) . s θ ( ) r s L , ˆ their took z � q Project onto z L ˆ r ˆ Shareholders took their money z Downstream task φ Backprop Loss L
Connections to Related Work SPIGOT STE z � r ˆ z L z � r ˆ z L ˆ ˆ �r s L �r s L ˆ z Pipeline STE Structured Att. SPIGOT Hard decision on ˆ z Backprop Marginal Projection Structured Attention: Kim et al., 2017
Connections to Related Work SPIGOT Structured Attention z � r ˆ z L ˆ ˆ z = softmax( . . . ) �r s L z = arg max ( . . . ) ˆ Pipeline STE Structured Att. SPIGOT Hard decision on ˆ z Backprop Marginal Projection Structured Attention: Kim et al., 2017
Applications Joint learning Swayamdipta et al., 2016 Training data L 1 took Shareholders their money Parser θ r θ L 1 arg max Shareholders took their money
Applications Joint learning Swayamdipta et al., 2016 Training data L 1 took Shareholders their money Parser θ r θ L 1 arg max r θ L 2 Shareholders took their money r φ L 2 Downstream task φ Loss L 2
Recommend
More recommend