Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML)
Structured Prediction x x the man bit the dog the man bit the dog x x DT NN VBD DT NN S y NP VP y=+ 1 y=- 1 the man bit the dog y x DT NN VB NP the man bit DT NN 那 人 咬 了 狗 y the dog • binary classification: output is binary • multiclass classification: output is a number (small # of classes) • structured classification: output is a structure (seq., tree, graph) • part-of-speech tagging, parsing, summarization, translation • exponentially many classes: search (inference) efficiency is crucial! 2
Generic Perceptron • online-learning: one example at a time • learning by doing • find the best output under the current weights • update weights at mistakes y i update weights x i z i inference w Structured Prediction 3
Perceptron: from binary to structured binary classification trivial w x x exact z x update weights 2 classes inference if y ≠ z y y=+ 1 y=- 1 easy w multiclass classification exact z x constant update weights inference # of classes if y ≠ z y hard exponential structured classification # of classes w the man bit the dog exact x z x update weights inference if y ≠ z y DT NN VBD DT NN y 4
From Perceptron to SVM batch +soft-margin 1964 1995 max Vapnik Cortes/Vapnik margin Chervonenkis SVM o fall of USSR n same journal! l i n subgradient descent e m a 2007--2010 a p x p kernels 1964 r m o a x Singer group r . g Aizerman+ kernel. perc. i n Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1999 Rosenblatt Novikoff Freund/Schapire 2003 2004 invention proof voted/avg: revived inseparable case Taskar Tsochantaridis M3N struct. SVM 2002 2005 Collins McDonald+ multinomial 2001 structured structured MIRA logistic regression Lafferty+ (max. entropy) CRF 5 AT&T Research ex-AT&T and students
Multiclass Classification • one weight vector (“prototype”) for each class: • multiclass decision rule: w ( z ) · x y = argmax ˆ (best agreement w/ prototype) z ∈ 1 ...M Q1: what about 2-class? Q2: do we still need augmented space?
Multiclass Perceptron • on an error, penalize the weight for the wrong class, and reward the weight for the true class
Convergence of Multiclass update rule: w ← w + ∆ Φ ( x , y, z ) separability: for all 9 u , s.t. 8 ( x , y ) 2 D, z 6 = y u · ∆ Φ ( x , y, z ) � δ
Example: POS Tagging • gold-standard: DT NN VBD DT NN y Φ ( x, y ) • the man bit the dog x • current output: DT NN NN DT NN z • the man bit the dog Φ ( x, z ) x • assume only two feature classes ɸ (x,y)={(<s>, DT): 1, (DT, NN):2, ..., (NN, </s>): 1, (DT, the):1, ..., • tag bigrams t i-1 t i (VBD, bit): 1, ..., (NN, dog): 1} ɸ (x,z)={(<s>, DT): 1, (DT, NN):2, ..., • word/tag pairs w i (NN, </s>): 1, (DT, the):1, ..., (NN, bit): 1, ..., (NN, dog): 1} • weights ++ : (NN, VBD) (VBD, DT) (VBD, bit) ɸ (x,y) - ɸ (x,z) • weights -- : (NN, NN) (NN, DT) (NN, bit) Structured Prediction 9
Structured Perceptron y i update weights inference x i z i w ← Structured Prediction 10
Inference: Dynamic Programming w exact z x update weights inference if y ≠ z y 11
Python implementation Q: what about top-down recursive + memoization? CS 562 - Lec 5-6: Probs & WFSTs 12
Efficiency vs. Expressiveness y i argmax y ∈ GEN ( x ) update weights inference x i z i w • the inference (argmax) must be efficient • either the search space GEN(x) is small, or factored • features must be local to y (but can be global to x) • e.g. bigram tagger, but look at all input words (cf. CRFs) y x Structured Prediction 13
Averaged Perceptron 0 j ← j + 1 j � W j = j • more stable and accurate results • approximation of voted perceptron (Freund & Schapire, 1999) Structured Prediction 14
Averaging Tricks sparse vector: defaultdict • Daume (2006, PhD thesis) ∆ w t w t w (0) = w (1) = ∆ w (1) w (2) = ∆ w (1) ∆ w (2) w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (2) ∆ w (3) ∆ w (1) ∆ w (4) Structured Prediction 15
Do we need smoothing? y x • smoothing is much easier in discriminative models • just make sure for each feature template, its subset templates are also included • e.g., to include ( t 0 w 0 w -1 ) you must also include • ( t 0 w 0 ) ( t 0 w -1 ) ( w 0 w -1 ) • and maybe also ( t 0 t -1 ) because t is less sparse than w Structured Prediction 16
Convergence with Exact Search • linear classification: converges iff. data is separable • structured: converges iff. data separable & search exact • there is an oracle vector that correctly labels all examples • one vs the rest (correct label better than all incorrect labels) • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter x 100 R: diameter R : d i a m e t e r y 100 x 100 x 111 δ δ x 3012 x 2000 z ≠ y 100 y=- 1 y=+ 1 Rosenblatt => Collins 1957 2002 17
Geometry of Convergence Proof pt 1 w exact z x update weights inference if y ≠ z exact y z 1-best update ∆ Φ ( x, y, z ) perceptron update: correct y label δ margin ≥ δ separation update (by induction) w ( k ) unit oracle current vector u model w ( k +1) w l e e d n o (part 1: upperbound) m 18
Geometry of Convergence Proof pt 2 w exact z x update weights inference if y ≠ z exact y z 1-best violation: incorrect label scored higher update ∆ Φ ( x, y, z ) perceptron update: correct y label R: max diameter update ≤ R 2 w ( k ) <90 ˚ violation diameter current model w ( k +1) (part 2: upperbound) w by induction: l e e d n o m parts 1+2 => update bounds: k ≤ R 2 / δ 2 19
Experiments
Experiments: Tagging • (almost) identical features from (Ratnaparkhi, 1996) • trigram tagger: current tag t i , previous tags t i-1 , t i-2 • current word w i and its spelling features • surrounding words w i-1 w i+1 w i-2 w i+2.. Structured Prediction 21
Experiments: NP Chunking • B-I-O scheme Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement ... B I I • features: • unigram model • surrounding words and POS tags Structured Prediction 22
Experiments: NP Chunking • results • (Sha and Pereira, 2003) trigram tagger • voted perceptron: 94.09% vs. CRF: 94.38% Structured Prediction 23
Structured SVM • structured perceptron: w · Δ ɸ (x,y,z) > 0 • SVM: for all (x,y), functional margin y(w · x) ≥ 1 • structured SVM version 1: simple loss • for all (x,y), for all z ≠ y, margin w · Δ ɸ (x,y,z) ≥ 1 • correct y has to score higher than any wrong z by 1 • structured SVM version 2: structured loss • for all (x,y), for all z ≠ y, margin w · Δ ɸ (x,y,z) ≥ ℓ (y,z) • correct y has to score higher than any wrong z by ℓ (y,z), a distance metric such as hamming loss Structured Prediction 24
Loss-Augmented Decoding • want for all z: w · ɸ (x,y) ≥ w · ɸ (x,z) + ℓ (y,z) • same as: w · ɸ (x,y) ≥ max z w · ɸ (x,z) + ℓ (y,z) • loss-augmented decoding: arg max z w · ɸ (x,z) + ℓ (y,z) • if ℓ (y,z) factors in z (e.g. hamming), just modify DP CIML version ← modified DP ← should have learning rate! λ = 1/(2 C ) very similar to Pegasos; but should use Pegasos framework instead Structured Prediction 25
Correct Version following Pegasos • want for all z: w · ɸ (x,y) ≥ w · ɸ (x,z) + ℓ (y,z) • same as: w · ɸ (x,y) ≥ max z w · ɸ (x,z) + ℓ (y,z) • loss-augmented decoding: arg max z w · ɸ (x,z) + ℓ (y,z) • if ℓ (y,z) factors in z (e.g. hamming), just modify DP 1/t N=|D|, C is from SVM ) NC/2t ( t +=1 for each example Structured Prediction 26
Recommend
More recommend