What is Structured Prediction? Scalable Large-Margin x x the man bit the dog the man bit the dog x x Structured Learning: DT NN VBD DT NN S y Theory and Algorithms NP VP y=+ 1 y=- 1 the man bit the dog y x DT NN VB NP the man bit DT NN 那 人 咬 了 狗 y x x the man bit the dog x the man hit the dog the dog • binary classification: output is binary DT NN VBD DT NN y 那 人 咬 了 狗 NLP is all about structured prediction! • multiclass classification: output is a (small) number y=+ 1 y=- 1 • structured classification: output is a structure (seq., tree, graph) Liang Huang Kai Zhao Lemao Liu The City University of New York (CUNY) • part-of-speech tagging, parsing, summarization, translation • exponentially many classes: search (inference) efficiency is crucial! 2 slides at: http://acl.cs.qc.edu/~lhuang/ Examples of Bad Structured Prediction Learning: Unstructured vs. Structured binary/multiclass structured learning generative naive HMMs bayes (count & divide) Conditional Conditional discriminative logistic CRFs regression (expectations) (maxent) Online+ Online+ Viterbi Viterbi perceptron structured perceptron (argmax) max max margin margin (loss-augmented SVM structured SVM argmax) 3 4
Why Perceptron (Online Learning)? Perceptron: from binary to structured binary perceptron • because we want scalability on big data! trivial w (Rosenblatt, 1959) exact x x z x update weights • learning time has to be linear in the number of examples 2 classes inference if y ≠ z y • can make only constant number of passes over training data y=+ 1 y=- 1 • only online learning (perceptron/MIRA) can guarantee this! multiclass perceptron easy w (Freund/Schapire, 1999) • SVM scales between O(n 2 ) and O(n 3 ); CRF no guarantee exact z constant x update weights inference # of classes if y ≠ z • and inference on each example must be super fast y • another advantage of perceptron: just need argmax structured perceptron (Collins, 2002) hard exponential # of classes w SVM the man bit the dog x exact z x update weights . . . inference if y ≠ z CRF 那 人 咬 了 狗 y y 5 6 Scalability Challenges Tutorial Outline • inference (on one example) is too slow (even w/ DP) • Overview of Structured Learning • can we sacrifice search exactness for faster learning? • Challenges in Scalability • would inexact search interfere with learning? • Structured Perceptron • if so, how should we modify learning? • convergence proof • even fastest inexact inference is still too slow 9 1 5 13 update update update • Structured Perceptron with Inexact Search update • due to too many training examples 10 6 2 14 update update update update 11 7 15 update 3 update • can we parallelize online learning? 12 update update 16 8 4 update update update update • Latent-Variable Structured Perceptron ⨁ • Parallelizing Online Learning (Perceptron & MIRA) . . . 7 8
Generic Perceptron Structured Perceptron • perceptron is the simplest machine learning algorithm DT NN VBD DT NN y i the man bit the dog update weights • online-learning: one example at a time inference x i z i DT NN NN DT NN • learning by doing w • find the best output under the current weights • update weights at mistakes y i update weights inference x i z i w ← 9 10 Example: POS Tagging Inference: Dynamic Programming • gold-standard: DT NN w VBD DT NN y exact Φ ( x, y ) z x update weights • the man bit the dog inference if y ≠ z x y • current output: DT NN NN DT NN z Φ ( x, z ) • the man bit the dog x • assume only two feature classes • tag bigrams t i-1 t i • word/tag pairs w i tagging: O ( nT 3 ) CKY parsing: O ( n 3 ) • weights ++ : (NN, VBD) (VBD, DT) (VBD → bit) • weights -- : (NN, NN) (NN, DT) (NN → bit) 11 12
Efficiency vs. Expressiveness Averaged Perceptron • much more stable and accurate results y i argmax y ∈ GEN ( x ) update weights • approximation of voted perceptron (large-margin) inference x i z i (Freund & Schapire, 1999) w • the inference (argmax) must be efficient j • either the search space GEN(x) is small, or factored 0 • features must be local to y (but can be global to x) • e.g. bigram tagger, but look at all input words (cf. CRFs) y ← j + 1 j � W j x = j 13 14 Averaging => Large Margin Efficient Implementation of Averaging • naive implementation (running sum) doesn’t scale • much more stable and accurate results • very clever trick from Daume (2006, PhD thesis) • approximation of voted perceptron (large-margin) (Freund & Schapire, 1999) ∆ w t w t w (0) = test error w (1) = ∆ w (1) w (2) = ∆ w (1) ∆ w (2) w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) 15 16
Perceptron vs. CRFs Perceptron Convergence Proof • binary classification: converges iff. data is separable • perceptron is online and Viterbi approximation of CRF • structured prediction: converges iff. data is separable • simpler to code; faster to converge; ~same accuracy • there is an oracle vector that correctly labels all examples online V CRFs i t • one vs the rest (correct label better than all incorrect labels) e (Lafferty et al, 2001) r b i • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter exp( w · Φ ( x, z )) X X Z ( x ) ( x,y ) ∈ D z ∈ GEN ( x ) x 100 t e r stochastic gradient a m e R : d i R: diameter hard/Viterbi CRFs descent (SGD) y 100 x 111 x 100 δ δ for ( x, y ) ∈ D, argmax w · Φ ( x, z ) z ∈ GEN ( x ) x 3012 V online i x 2000 t e structured perceptron r b z ≠ y 100 i y=- 1 y=+ 1 (Collins, 2002) Novikoff => Freund & Schapire => Collins 17 18 1962 1999 2002 Geometry of Convergence Proof pt 1 Geometry of Convergence Proof pt 2 summary: the proof uses 3 facts: w w 1. separation (margin) exact exact z z x x update weights update weights 2. diameter (always finite) inference inference if y ≠ z if y ≠ z exact exact 3. violation (guaranteed by exact search) y y z z 1-best 1-best violation: incorrect label scored higher ∆ update ∆ update Φ Φ perceptron update: perceptron update: ( ( x x , , y y , , z z ) correct ) correct y y label label R: max diameter δ margin ≥ δ separation update update ≤ R 2 kR √ (by induction) w ( k ) w ( k ) violation diameter k δ <90 ˚ unit oracle current current k w k +1 k 2 kR 2 vector u model model (part 2: upperbound) by induction: p w ( k +1) w ( k +1) new new model model k w k +1 k kR k w k +1 k � k δ (part 1: lowerbound) (part 1: lowerbound) combine with: k w k +1 k � k δ k ≤ R 2 / δ 2 bound on # of updates: 19 20
Tutorial Outline Scalability Challenge 1: Inference binary classification trivial w x x • Overview of Structured Learning exact x z constant update weights inference # of classes if y ≠ z • Challenges in Scalability y y=+ 1 y=- 1 • Structured Perceptron hard exponential structured classification # of classes w • convergence proof exact the man bit the dog x z x update weights inference if y ≠ z • Structured Perceptron with Inexact Search y DT NN VBD DT NN y • challenge: search efficiency (exponentially many classes) • often use dynamic programming (DP) • Latent-Variable Perceptron • but DP is still too slow for repeated use, e.g. parsing O ( n 3 ) • Parallelizing Online Learning (Perceptron & MIRA) • Q: can we sacrifice search exactness for faster learning? 21 22 Perceptron w/ Inexact Inference Bad News and Good News w w the man bit the dog the man bit the dog x inexact x inexact x z x z update weights update weights inference inference if y ≠ z if y ≠ z DT NN VBD DT NN DT NN VBD DT NN y y y y A: it no longer works as is, Q: does perceptron still work??? but we can make it work by some magic. beam search beam search greedy search greedy search • routine use of inexact inference in NLP (e.g. beam search) • bad news: no more guarantee of convergence • how does structured perceptron work with inexact search? • in practice perceptron degrades a lot due to search errors • good news: new update methods guarantee convergence • so far most structured learning theory assume exact search • would search errors break these learning properties? • new perceptron variants that “live with” search errors • if so how to modify learning to accommodate inexact search? • in practice they work really well w/ inexact search 23 24
Recommend
More recommend