Stuff I did in the Spring while not Replying to Email (aka “advances in structured prediction”) Hal Daumé III | University of Maryland | me@hal3.name | @haldaume3
Examples of structured prediction joint The monster ate a big sandwich
Sequence labeling x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = Yesterday I traveled to Lille y = - PER - - LOC The monster ate a big sandwich image credit: Richard Padgett
Natural language parsing OUTPUT [root] object n-mod n-mod subject n-mod p-mod n-mod NLP algorithms use a kitchen sink of features INPUT
image credit: Ben Taskar; Liz Jurrus (Bipartite) matching
Machine translation
image credit: Daniel Muñoz Segmentation
Protein secondary structure prediction
Outline Isn't this ➢ Background: learning to search kinda narrow? ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks 9 Hal Daumé III (me@hal3.name) LOLS
My experience, 6 months in industry ➢ Standard adage: academia=freedom, industry=time ➢ Number of responsibilities vs number of bosses ➢ Aspects I didn't anticipate ➢ Breadth (academia) versus depth (industry) ➢ Collaborating through students versus directly ➢ Security through tenure versus security through $ ➢ At the end of the day: who are your colleagues and what do you have to do to pay the piper? Major caveat: this is comparing a top ranked CS dept to top industry lab, in a time when there's tons of money in this area (more in industry) 10 Hal Daumé III (me@hal3.name) LOLS
Joint prediction via learning to search Part of Speech Tagging NN NNS VBP DT NN NN IN NNS NLP algorithms use a kitchen sink of features Dependency Parsing *ROOT* NLP algorithms use a kitchen sink of features
Joint prediction via learning to search use a algorithms Joint Prediction Haiku Joint Prediction Haiku kitchen A joint prediction A joint prediction NLP Across a single input Across a single input Loss measured jointly Loss measured jointly sink *ROOT* of features
Back to the original problem... ● How to optimize a discrete, joint loss? ● Input: x ∈ X I can can a can Pro Md Vb Dt Nn ● Truth: y ∈ Y ( x ) Pro Md Md Dt Vb ● Outputs: Pro Md Md Dt Nn Y ( x ) Pro Md Nn Dt Md ● Predicted: ŷ ∈ Y ( x ) Pro Md Nn Dt Vb ● Loss: Pro Md Nn Dt Nn l o s s ( y , ŷ ) Pro Md Vb Dt Md ● Data: ( x , y ) ~ D Pro Md Vb Dt Vb
Back to the original problem... ● How to optimize a discrete, joint loss? ● Input: Goal: x ∈ X ● Truth: y ∈ Y ( x ) find h ∈ H such that h ● Outputs: ( x ) ∈ Y ( x ) Y ( x ) minimizing ● Predicted: ŷ ∈ Y ( x ) [ ] E D l o s s ( y , h ( x ) ) ● Loss: ( x , y ) ~ l o s s ( y , ŷ ) based on N samples ● Data: ( x , y ) ~ D ( x , y ) ~ D n n
Search spaces ● When y decomposes in an ordered manner, a sequential decision making process emerges decision I action Pro Md Vb Dt Nn decision can action Pro Md Vb Dt Nn decision can action Pro Md Vb Dt Nn
Search spaces ● When y decomposes in an ordered manner, a sequential decision making process emerges Encodes an output ŷ = ŷ ( e ) a from which Pro Md Vb Dt Nn l o s s ( y , ŷ ) can be computed can (at training time) Pro Md Vb Dt Nn e end
Policies ● A policy maps observations to actions obs. π ( ) input: x = a timestep: t partial traj: τ … anything else
An analogy from playing Mario From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} High level goal: Extracted 27K+ binary features Watch an expert play and from last 4 observations (14 binary features for every cell) learn to mimic her behavior
Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Training (expert)
Warm-up: Supervised learning 1.Collect trajectories from expert π ref 2.Store as dataset D = { ( o, π ref (o,y) ) | o ~ π ref } 3.Train classifier π on D ● Let π play the game! ref π ref π
Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Test-time execution (sup. learning)
What's the (biggest) failure mode? The expert never gets stuck next to pipes ⇒ Classifier doesn't learn to recover! ref π ref π
Warm-up II: Imitation learning 1. Collect trajectories from expert π ref If N = T log T, 2. Dataset D 0 = { ( o, π ref (o,y) ) | o ~ π ref } L( π n ) < T N + O(1) 3. Train π 1 on D 0 for some n 4. Collect new trajectories from π 1 ➢ But let the expert steer! π 1 π 1 5. Dataset D 1 = { ( o, π ref (o,y) ) | o ~ π 1 } π 2 π 2 6. Train π 2 on D 0 ∪ D 1 ref π ref π ● In general: ● D n = { ( o, π ref (o,y) ) | o ~ π n } ● Train π n+1 on ∪ i≤n D i
Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Test-time execution (DAgger)
What's the biggest failure mode? Classifier only sees right versus not-right ● No notion of better or worse ● No partial credit ● Must have a single target answer π 1 π 1 π 2 π 2 π * * π
Learning to search: AggraVaTe 1.Let learned policy π drive for t timesteps to obs. o 2.For each possible action a : ● Take action a , and let expert π drive the rest r e f ● Record the overall loss, c a π π 3.Update π based on example: 0 . 4 ( o , 〈 c , c , . . . , c 〉 ) 1 2 K 0 4.Goto (1) 1 0 0
Training time versus test accuracy
Training time versus test accuracy
Test time speed
State of the art accuracy in.... ● Part of speech tagging (1 million words) ● wc: 3.2 seconds ● US: 6 lines of code 10 seconds to train ● CRFsgd: 1068 lines 30 minutes ● CRF++: 777 lines hours ● Named entity recognition (200 thousand words) ● wc: 0.8 seconds ● US: 30 lines of code 5 seconds to train ● CRFsgd: 1 minute ● CRF++: 10 minutes ● SVM str : 876 lines 30 minutes (suboptimal accuracy)
The Magic ● You write some greedy “test-time” code ● In your favorite imperative language (C++/Python) ● It makes arbitrary calls to a Predict function ● And you add some minor decoration ● We will automatically: ● Perform learning ● Generate non-determinstic (beam) search ● Run faster than specialized learning software
How to train? E loss=0 loss=.2 S R E loss=.8 E one-step deviations rollin 1.Generate an initial trajectory rollout using a rollin policy 2.Foreach state R on that trajectory: a) Foreach possible action a (one-step deviations) i. Take that action ii. Complete this trajectory using a rollout policy iii. Obtain a final loss b) Generate a cost-sensitive classification example: ( Φ (R), 〈 c a 〉 a ∈ A )
The magic in practice A “hint” about the correct decision run(vector<example> ec) I'm really only at training time for i = 0 .. ec.size not hiding y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) anything... Loss( # of y_true != y_pred ) How bad was the entire void run(search& sch, vector<example*> ec) { for (size_t i=0; i<ec.size(); i++) { sequence of predictions uint32_t y_true = get_example_label(ec[i]); (at training time) uint32_t y_pred = sch.predict(ec[i], y_true); sch.loss( y_true != y_pred ); if (sch.output().good()) sch.output() << y_pred << ' '; } }
The illusion of control ● Execute run O(T x A) times, modifying Predict ● For each time step myT = 1 .. T: For each possible action myA = 1 .. A: myA if t = myT define Predict (...) = π otherwise run your code in full set cost a = result of Loss Make classification example on x myT with <cost a > run(vector<example> ec) for i = 0 .. ec.size y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) Loss( # of y_true != y_pred )
Entity/relation identification 35 Hal Daumé III (me@hal3.name) LOLS
Dependency parsing 36 Hal Daumé III (me@hal3.name) LOLS
Outline ➢ Background: learning to search ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks 37 Hal Daumé III (me@hal3.name) LOLS
38 Hal Daumé III (me@hal3.name) LOLS
39 Hal Daumé III (me@hal3.name) LOLS
40 Hal Daumé III (me@hal3.name) LOLS
41 Hal Daumé III (me@hal3.name) LOLS
42 Hal Daumé III (me@hal3.name) LOLS
43 Hal Daumé III (me@hal3.name) LOLS
44 Hal Daumé III (me@hal3.name) LOLS
45 Hal Daumé III (me@hal3.name) LOLS
46 Hal Daumé III (me@hal3.name) LOLS
47 Hal Daumé III (me@hal3.name) LOLS
Recommend
More recommend