structured learning with inexact search
play

Structured Learning with Inexact Search x x the man bit - PowerPoint PPT Presentation

Structured Learning with Inexact Search x x the man bit the dog x the man hit the dog DT NN VBD DT NN y y=+ 1 y=- 1 Liang Huang The City University of New York (CUNY) includes joint work with


  1. Structured Learning with Inexact Search x x the man bit the dog x the man hit the dog DT NN VBD DT NN y 那 人 咬 了 狗 y=+ 1 y=- 1 Liang Huang The City University of New York (CUNY) includes joint work with S. Phayong, Y. Guo, and K. Zhao

  2. Structured Perceptron (Collins 02) binary classification trivial w x x exact z x constant update weights inference # of classes if y ≠ z y y=+ 1 y=- 1 hard exponential structured classification w # of classes exact the man bit the dog x z x update weights inference if y ≠ z y DT NN VBD DT NN y • challenge: search efficiency (exponentially many classes) • often use dynamic programming (DP) • but still too slow for repeated use, e.g. parsing is O ( n 3 ) • and can’t use non-local features in DP 2

  3. Perceptron w/ Inexact Inference w the man bit the dog x inexact z x update weights inference if y ≠ z DT NN VBD DT NN y y does it still work??? beam search greedy search • routine use of inexact inference in NLP (e.g. beam search) • how does structured perceptron work with inexact search? • so far most structured learning theory assume exact search • would search errors break these learning properties? • if so how to modify learning to accommodate inexact search? 3

  4. Idea: Search-Error-Robust Model w training inexact z x update weights inference if y ≠ z y w testing inexact z x inference • train a “search-specific” or “search-error-robust” model • we assume the same “search box” in training and testing • model should “live with” search errors from search box • exact search => convergence; greedy => no convergence • how can we make perceptron converge w/ greedy search? Liang Huang (CUNY) 4

  5. Our Contributions w greedy z x early update on or beam prefixes y’, z’ y • theory: a framework for perceptron w/ inexact search • explains previous work (early update etc) as special cases • practice: new update methods within the framework • converges faster and better than early update • real impact on state-of-the-art parsing and tagging • more advantageous when search error is severer 5

  6. In this talk... • Motivations: Structured Learning and Search Efficiency • Structured Perceptron and Inexact Search • perceptron does not converge with inexact search • early update (Collins/Roark ’04) seems to help; but why? • New Perceptron Framework for Inexact Search • explains early update as a special case • convergence theory with arbitrarily inexact search • new update methods within this framework • Experiments 6

  7. Structured Perceptron (Collins 02) • simple generalization from binary/multiclass perceptron • online learning: for each example (x, y) in data • inference: find the best output z given current weight w • update weights when if y ≠ z trivial x x w constant exact z x update weights classes inference if y ≠ z y y=+ 1 y=- 1 hard exponential w classes the man bit the dog x exact z x update weights inference if y ≠ z DT NN VBD DT NN y y 7

  8. Convergence with Exact Search • linear classification: converges iff. data is separable • structured: converges iff. data separable & search exact • there is an oracle vector that correctly labels all examples • one vs the rest (correct label better than all incorrect labels) • theorem: if separable, then # of updates ≤ R 2 / δ 2 R: diameter x 100 R: diameter R: diameter y 100 x 100 x 111 δ δ x 3012 x 2000 z ≠ y 100 y=- 1 y=+ 1 Rosenblatt => Collins 1957 2002 8

  9. Convergence with Exact Search V V current V N V training example model time flies correct N V w ( k +1) label w ( k ) V N e output space t a N d p {N,V} x {N, V} u standard perceptron converges with exact search 9

  10. No Convergence w/ Greedy Search V V current V N V training example model time flies correct w ( k ) N V label V N output space N new {N,V} x {N, V} model V V N V V w ( k +1) e t standard perceptron a w ( k ) d ∆ Φ ( x, y, z ) p u does not converge with greedy search V N N 10

  11. Early update (Collins/Roark 2004) to rescue V V current V N V training example model time flies correct w ( k ) N V label V N output space N new {N,V} x {N, V} model N V V w ( k +1) e t standard perceptron a w ( k ) d ∆ Φ ( x, y, z ) p u does not converge with greedy search V N V ∆ Φ ( x, y, z ) w ( k ) stop and update at the first mistake w ( k +1) N new model 11

  12. Why? N V V V V ∆ Φ ( x, y, z ) w ( k ) V N N w ( k +1) • why does inexact search break convergence property? • what is required for convergence? exactness? • why does early update (Collins/Roark 04) work? • it works well in practice and is now a standard method • but there has been no theoretical justification • we answer these Qs by inspecting the convergence proof 12

  13. Geometry of Convergence Proof pt 1 w exact z x update weights inference if y ≠ z exact y z 1-best update ∆ Φ ( x, y, z ) perceptron update: correct y label δ margin ≥ δ separation update (by induction) w ( k ) unit oracle current vector u model w ( k +1) new model (part 1: upperbound) 13

  14. Geometry of Convergence Proof pt 2 w exact z x update weights inference if y ≠ z exact y z 1-best violation: incorrect label scored higher update ∆ Φ ( x, y, z ) perceptron update: correct y label R: max diameter update ≤ R 2 w ( k ) <90 ˚ violation diameter current model w ( k +1) new (part 2: upperbound) by induction: model parts 1+2 => update bounds: k ≤ R 2 / δ 2 14

  15. Violation is All we need! • exact search is not really required by the proof • rather, it is only used to ensure violation! exact all z 1-best violation: incorrect label scored higher violations update ∆ Φ ( x, y, z ) correct y label R: max diameter the proof only uses 3 facts: update 1. separation (margin) w ( k ) 2. diameter (always finite) <90 ˚ current 3. violation (but no need for exact) model w ( k +1) new model 15

  16. Violation-Fixing Perceptron • if we guarantee violation, we don’t care about exactness! • violation is good b/c we can at least fix a mistake all same mistake bound as before! violations all possible updates y standard perceptron violation-fixing perceptron w w exact z z x x update weights find update weights inference violation if y ≠ z if y’ ≠ z y y y ’ 16

  17. What if can’t guarantee violation • this is why perceptron doesn’t work well w/ inexact search • because not every update is guaranteed to be a violation • thus the proof breaks; no convergence guarantee • example: beam or greedy search • the model might prefer the correct label (if exact search) • but the search prunes it away d • such a non-violation update is “bad” a b e t a d p u because it doesn’t fix any mistake beam • the new model still misguides the search current model 17

  18. Standard Update: No Guarantee V V V N V training example time flies correct w ( k ) N V label V N output space N {N,V} x {N, V} V V N V V w ( k +1) standard update w ( k ) ∆ Φ ( x, y, z ) doesn’t converge b/c it doesn’t guarantee violation V N N correct label scores higher. non-violation: bad update! 18

  19. Early Update: Guarantees Violation V V V N V training example time flies correct w ( k ) N V label V N output space N {N,V} x {N, V} V V N V V w ( k +1) standard update w ( k ) ∆ Φ ( x, y, z ) doesn’t converge b/c it doesn’t guarantee violation V N N V ∆ Φ ( x, y, z ) w ( k ) early update: incorrect prefix w ( k +1) scores higher: a violation! N 19

  20. Early Update: from Greedy to Beam • beam search is a generalization of greedy (where b=1) • at each stage we keep top b hypothesis • widely used: tagging, parsing, translation... • early update -- when correct label first falls off the beam • up to this point the incorrect prefix should score higher • standard update (full update) -- no guarantee! correct violation guaranteed: update early incorrect prefix scores incorrect higher up to this point standard update correct label (no guarantee!) falls off beam 20 (pruned)

  21. Early Update as Violation-Fixing also new definition of w “beam separability”: a correct prefix should z x find update weights score higher than violation if y’ ≠ z any incorrect prefix y y’ of the same length prefix violations (maybe too strong) y cf. Kulesza and Pereira,2007 z update early beam y’ standard update correct label (bad!) falls off beam (pruned) 21

  22. New Update Methods: max-violation, ... beam standard latest early (bad!) max-violation • we now established a theory for early update (Collins/Roark) • but it learns too slowly due to partial updates • max-violation: use the prefix where violation is maximum • “worst-mistake” in the search space • all these update methods are violation-fixing perceptrons 22

  23. Experiments trigram part-of-speech tagging incremental dependency parsing the man bit the dog x the man bit the dog x DT NN VBD DT NN bit y man dog y the the local features only, non-local features, exact search tractable exact search intractable (real impact) (proof of concept)

Recommend


More recommend