Learned Prioritization for Trading Off Speed and Accuracy Jiarong Jiang 1 Adam Teichert 2 Hal Daumé III 1 Jason Eisner 2 1 University of Maryland, College Park 2 Johns Hopkins University ICML workshop on Inferning: Interactions between Inference and Learning Jiang, Teichert, Daumé, Eisner (UMD, JHU) 1 / 21
Introduction Introduction Fast and accurate structured prediction Jiang, Teichert, Daumé, Eisner (UMD, JHU) 2 / 21
Introduction Introduction Fast and accurate structured prediction Manual exploration of speed/accuracy tradeoff Prioritization heuristics A* [Klein and Manning, 2003] Hierarchical A* [Pauls and Klein, 2010] Pruning heuristics Coarse-to-fine pruning [Charniak et al., 2006; Petrov and Klein, 2007] Classifier-based pruning [Roark and Hollingshead, 2008] Jiang, Teichert, Daumé, Eisner (UMD, JHU) 2 / 21
Introduction Introduction Fast and accurate structured prediction Manual exploration of speed/accuracy tradeoff Prioritization heuristics A* [Klein and Manning, 2003] Hierarchical A* [Pauls and Klein, 2010] Pruning heuristics Coarse-to-fine pruning [Charniak et al., 2006; Petrov and Klein, 2007] Classifier-based pruning [Roark and Hollingshead, 2008] Goal: learn a heuristic for your input distribution, grammar, and speed/accuracy needs Jiang, Teichert, Daumé, Eisner (UMD, JHU) 2 / 21
Introduction Introduction Fast and accurate structured prediction Manual exploration of speed/accuracy tradeoff Prioritization heuristics A* [Klein and Manning, 2003] Hierarchical A* [Pauls and Klein, 2010] Pruning heuristics Coarse-to-fine pruning [Charniak et al., 2006; Petrov and Klein, 2007] Classifier-based pruning [Roark and Hollingshead, 2008] Goal: learn a heuristic for your input distribution, grammar, and speed/accuracy needs Objective measure quality = accuracy − λ × time Jiang, Teichert, Daumé, Eisner (UMD, JHU) 2 / 21
Priority-based Inference Agenda-based Parsing S S PP NP VP NP N V P DET N 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Agenda-based Parsing GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP 1 VP -> VP PP AGENDA 2 VP-> V NP 1 NP -> DET N 10 3 NP 5 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Agenda-based Parsing GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP 1 VP -> VP PP AGENDA 2 VP-> V NP 1 NP -> DET N 10 3 NP 5 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Agenda-based Parsing GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP 1 VP -> VP PP AGENDA 2 VP-> V NP 1 NP -> DET N 10 3 NP 5 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP NP 10 S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Agenda-based Parsing GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP 1 VP -> VP PP AGENDA 2 VP-> V NP 1 NP -> DET N 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP NP 10 S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Agenda-based Parsing GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP AGENDA 1 VP -> VP PP 2 VP-> V NP 10 2 PP 5 1 NP -> DET N 12 2 VP 5 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP NP 10 S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Agenda-based Parsing GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP AGENDA 1 VP -> VP PP 2 VP-> V NP 10 2 PP 5 1 NP -> DET N 12 2 VP 5 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP NP 10 S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 3 / 21
Priority-based Inference Speed Accuracy for Agenda-based Parsing All experiments are on Penn Treebank WSJ with sentence length ≤ 15. Preliminary results setup: Berkeley latent variable PCFG trained on section 2-20 Training set: 100 sentences from section 21 Evaluated on the same 100 sentences Baseline 1: Exhaustive Search Recall: 93.3; Relative number of pops: 3.0x Baseline 2: Uniform Cost Search (UC) Recall: 93.3; Relative number of pops: 1.0x Baseline 3: Pruned Uniform Cost Search Recall: 92.0; Relative number of pops: 0.33x Jiang, Teichert, Daumé, Eisner (UMD, JHU) 4 / 21
Priority-based Inference Agenda-based Parsing as a Markov Decision Process State space: current chart and agenda Action: pop a partial parse from the agenda Transition: Given the chosen action, deterministically updates chart and pushes other parses to the agenda Policy: computes action priorities from extracted features π θ ( s ) = arg max θ · φ ( a , s ) a (Delayed) Reward reward = accuracy − λ × time accuracy = labeled span recall time = # of pops from agenda Jiang, Teichert, Daumé, Eisner (UMD, JHU) 5 / 21
Priority-based Inference Agenda-based Parsing as a Markov Decision Process State space: current chart and agenda Action: pop a partial parse from the agenda Transition: Given the chosen action, deterministically updates chart and pushes other parses to the agenda Policy: computes action priorities from extracted features π θ ( s ) = arg max θ · φ ( a , s ) a (Delayed) Reward reward = accuracy − λ × time accuracy = labeled span recall time = # of pops from agenda ✞ ☎ Learning Policy = Learning Prioritization Function ✝ ✆ Jiang, Teichert, Daumé, Eisner (UMD, JHU) 5 / 21
Priority-based Inference Decoding as a Markov Decision Process (MDP) GRAMMAR 1 S -> NP VP 6 S -> Vst NP 2 S -> S PP AGENDA 1 VP -> VP PP 2 VP-> V NP 10??? 2 PP 5 1 NP -> DET N 12??? 2 VP 5 2 NP -> NP PP 3 NP -> NP NP 0 PP -> P NP NP 10 S 8 NP 3 NP 4 P 2 DET 1 N 8 Vst 3 VP 4 V 5 0 Time 1 flies 2 like 3 an 4 arrow 5 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 6 / 21
Attempt 1: Policy Gradient with Boltzmann Exploration Boltzmann Exploration Transition at test time: deterministic Transition at training time: exploration with stochastic policies: π � θ ( a | s ) . Boltzmann exploration: 1 � 1 � � θ · � π � θ ( a | s ) = Z ( s ) exp φ ( a , s ) temp Temperature → 0, exploration → exploitation A trajectory τ = � s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s T , a T , r T � . Expected future reward: � T � � R = E τ ∼ π � θ [ R ( τ )] = E τ ∼ π � r t . θ t = 0 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 7 / 21
Attempt 1: Policy Gradient with Boltzmann Exploration Policy Gradient Find parameters that maximize the expected reward with respect to the induced distribution over trajectories Policy gradient [Sutton et al., 2000] The gradient of the objective T � � � ∇ � θ E τ [ R ( τ )] = E τ R ( τ ) ∇ � θ log π ( a t | s t ) t = 0 where � � 1 θ ( a ′ | s t ) � � � φ ( a ′ , s t ) ∇ � θ log π � θ ( a | s ) = φ ( a t , s t ) − π � temp a ′ ∈ A Jiang, Teichert, Daumé, Eisner (UMD, JHU) 8 / 21
Attempt 1: Policy Gradient with Boltzmann Exploration Features Width of partial parse 1 Viterbi inside score 2 Touches start of sentence? 3 Touches end of sentence? 4 Ratio of width to sentence length 5 log p ( label | prev POS ) and log p ( label | next POS ) 6 (statistics extracted from labeled trees, word POS assumed to be most frequent) Case pattern of first word in partial parse and previous/next word 7 Punctuation pattern in partial parse (five most frequent) 8 Jiang, Teichert, Daumé, Eisner (UMD, JHU) 9 / 21
Attempt 1: Policy Gradient with Boltzmann Exploration Policy Gradient with Boltzmann Exploration Preliminary results: Method Recall Relative # of pops Policy Gradient w/ 56.4 0.46x Boltzmann Exploration Uniform cost search 93.3 1.0x Pruned uniform cost search 92.0 0.33x Jiang, Teichert, Daumé, Eisner (UMD, JHU) 10 / 21
Attempt 1: Policy Gradient with Boltzmann Exploration Policy Gradient with Boltzmann Exploration Preliminary results: Method Recall Relative # of pops Policy Gradient w/ 56.4 0.46x Boltzmann Exploration Uniform cost search 93.3 1.0x Pruned uniform cost search 92.0 0.33x Main Difficulty : ✞ ☎ Which actions were “responsible” for a trajectory’s reward? ✝ ✆ Jiang, Teichert, Daumé, Eisner (UMD, JHU) 10 / 21
Recommend
More recommend