viterbi training improves
play

Viterbi Training Improves Unsupervised Dependency Parsing Valentin - PowerPoint PPT Presentation

Viterbi Training Improves Unsupervised Dependency Parsing Valentin I. Spitkovsky with Hiyan Alshawi (Google Inc.) Daniel Jurafsky (Stanford University) and Christopher D. Manning (Stanford University) Spitkovsky et al. (Stanford & Google)


  1. The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

  2. The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

  3. The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ non-convex objective — very sensitive to initialization Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

  4. The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ non-convex objective — very sensitive to initialization maximizing the probability of data (sentence strings): � � ˆ θ UNS = arg max P θ ( t ) θ s t ∈ T ( s ) � �� � P θ ( s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

  5. The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ non-convex objective — very sensitive to initialization maximizing the probability of data (sentence strings): � � ˆ θ UNS = arg max P θ ( t ) θ s t ∈ T ( s ) � �� � P θ ( s ) supervised objective would be convex (counting): � ˆ P θ ( t ∗ ( s )) θ SUP = arg max θ s Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

  6. The Data WSJ Standard Corpus: WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  7. The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  8. The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  9. The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  10. The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — { t ∗ } , using “head percolation rules” (Collins, 1999) . Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  11. The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — { t ∗ } , using “head percolation rules” (Collins, 1999) . Training: traditionally, WSJ10 (Klein, 2005) ; Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  12. The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — { t ∗ } , using “head percolation rules” (Collins, 1999) . Training: traditionally, WSJ10 (Klein, 2005) ; Evaluation: Section 23 of WSJ ∞ (all sentences). Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

  13. The Data WSJ Standard Corpus: WSJ k 45 900 Sentences (1,000s) 40 800 35 700 30 600 25 500 20 400 15 300 10 200 Tokens (1,000s) 5 100 5 10 15 20 25 30 35 40 45 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 9 / 26

  14. The Data WSJ Standard Corpus: WSJ k 45 900 Sentences (1,000s) 40 800 35 700 30 600 25 500 20 400 15 300 10 200 Tokens (1,000s) 5 100 5 10 15 20 25 30 35 40 45 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 9 / 26

  15. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 40 30 20 10 5 10 15 20 25 30 35 40 WSJ k

  16. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

  17. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

  18. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

  19. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 Oracle 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

  20. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 Oracle 40 30 20 Uninformed K&M ∗ 10 5 10 15 20 25 30 35 40 WSJ k

  21. Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 Oracle 40 30 20 Uninformed K&M ∗ 10 5 10 15 20 25 30 35 40 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 10 / 26

  22. Results Viterbi EM Viterbi EM: Results! Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 10 5 10 15 20 25 30 35 40 WSJ k

  23. Results Viterbi EM Viterbi EM: Results! Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

  24. Results Viterbi EM Viterbi EM: Results! Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 Uninformed K&M ∗ 10 5 10 15 20 25 30 35 40 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 11 / 26

  25. Results Viterbi EM State-of-the-Art Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  26. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  27. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% (Spitkovsky et al., 2010) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  28. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% (Spitkovsky et al., 2010) DMV with Viterbi EM with Smoothing 45% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  29. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% (Spitkovsky et al., 2010) DMV with Viterbi EM with Smoothing 45% + Clever Initialization 48% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  30. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Brown100 Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% 43% (Spitkovsky et al., 2010) DMV with Viterbi EM with Smoothing 45% 48% + Clever Initialization 48% 51% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  31. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Brown100 Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% 43% (Spitkovsky et al., 2010) DMV with Viterbi EM ( + 5%) with Smoothing 45% 48% + Clever Initialization 48% 51% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  32. Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Brown100 Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% 43% (Spitkovsky et al., 2010) DMV with Viterbi EM ( + 5%) with Smoothing 45% 48% ( + 3%) + Clever Initialization 48% 51% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

  33. Interpretation Interpretation: Why Does Viterbi EM Work? Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  34. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  35. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  36. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  37. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  38. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  39. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  40. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  41. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner (knows the true value of everything at all times) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  42. Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner (knows the true value of everything at all times) — could work, given a very powerful model θ ... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

  43. Interpretation Interpretation: How Does Classic EM Fail? Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  44. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  45. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  46. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  47. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  48. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  49. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  50. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  51. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  52. Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution... ... wagged by its very long tail Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

  53. Interpretation Interpretation: Idealogical Difference! Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  54. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  55. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  56. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  57. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  58. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  59. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  60. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) improves with more data (statistics become efficient) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  61. Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) improves with more data (statistics become efficient) — really, what we want from unsupervised learners! Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

  62. Interpretation Interpretation: Summary Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

  63. Interpretation Interpretation: Summary Viterbi EM: focus on the individual best parse trees Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

Recommend


More recommend