Viterbi Training Improves Unsupervised Dependency Parsing Valentin - PowerPoint PPT Presentation

The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ non-convex objective — very sensitive to initialization Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ non-convex objective — very sensitive to initialization maximizing the probability of data (sentence strings): � � ˆ θ UNS = arg max P θ ( t ) θ s t ∈ T ( s ) � �� P θ ( s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

The Problem Learning Learning: EM, via inside-outside re-estimation sentences { s } , legal parse trees t ∈ T ( s ) , and a gold t ∗ non-convex objective — very sensitive to initialization maximizing the probability of data (sentence strings): � � ˆ θ UNS = arg max P θ ( t ) θ s t ∈ T ( s ) � �� P θ ( s ) supervised objective would be convex (counting): � ˆ P θ ( t ∗ ( s )) θ SUP = arg max θ s Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

The Data WSJ Standard Corpus: WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — { t ∗ } , using “head percolation rules” (Collins, 1999) . Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — { t ∗ } , using “head percolation rules” (Collins, 1999) . Training: traditionally, WSJ10 (Klein, 2005) ; Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993) ◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — { t ∗ } , using “head percolation rules” (Collins, 1999) . Training: traditionally, WSJ10 (Klein, 2005) ; Evaluation: Section 23 of WSJ ∞ (all sentences). Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

The Data WSJ Standard Corpus: WSJ k 45 900 Sentences (1,000s) 40 800 35 700 30 600 25 500 20 400 15 300 10 200 Tokens (1,000s) 5 100 5 10 15 20 25 30 35 40 45 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 9 / 26

Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 40 30 20 10 5 10 15 20 25 30 35 40 WSJ k

Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 Oracle 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 Oracle 40 30 20 Uninformed K&M ∗ 10 5 10 15 20 25 30 35 40 WSJ k

Results Classic EM Classic EM: The Lay of the Land Directed Dependency Accuracy (%) on WSJ40 70 60 50 Oracle 40 30 20 Uninformed K&M ∗ 10 5 10 15 20 25 30 35 40 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 10 / 26

Results Viterbi EM Viterbi EM: Results! Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 10 5 10 15 20 25 30 35 40 WSJ k

Results Viterbi EM Viterbi EM: Results! Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 Uninformed 10 5 10 15 20 25 30 35 40 WSJ k

Results Viterbi EM Viterbi EM: Results! Directed Dependency Accuracy (%) on WSJ40 70 Oracle 60 50 40 30 20 Uninformed K&M ∗ 10 5 10 15 20 25 30 35 40 WSJ k Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 11 / 26

Results Viterbi EM State-of-the-Art Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% (Spitkovsky et al., 2010) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% (Spitkovsky et al., 2010) DMV with Viterbi EM with Smoothing 45% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% (Spitkovsky et al., 2010) DMV with Viterbi EM with Smoothing 45% + Clever Initialization 48% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Brown100 Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% 43% (Spitkovsky et al., 2010) DMV with Viterbi EM with Smoothing 45% 48% + Clever Initialization 48% 51% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Brown100 Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% 43% (Spitkovsky et al., 2010) DMV with Viterbi EM ( + 5%) with Smoothing 45% 48% + Clever Initialization 48% 51% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Results Viterbi EM State-of-the-Art Section 23 of WSJ ∞ Brown100 Right-Branching Baseline 32% (Klein and Manning, 2004) DMV with Classic EM 34% (Klein and Manning, 2004) 45% 43% (Spitkovsky et al., 2010) DMV with Viterbi EM ( + 5%) with Smoothing 45% 48% ( + 3%) + Clever Initialization 48% 51% Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner (knows the true value of everything at all times) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: Why Does Viterbi EM Work? in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → { t } = T ( s ) Classic EM: w t = P θ ( t | s ) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner (knows the true value of everything at all times) — could work, given a very powerful model θ ... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

Interpretation Interpretation: How Does Classic EM Fail? Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution... Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: How Does Classic EM Fail? our model is quite weak (e.g., doesn’t handle agreement) reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution... ... wagged by its very long tail Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

Interpretation Interpretation: Idealogical Difference! Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) improves with more data (statistics become efficient) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Idealogical Difference! Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) improves with more data (statistics become efficient) — really, what we want from unsupervised learners! Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

Interpretation Interpretation: Summary Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

Interpretation Interpretation: Summary Viterbi EM: focus on the individual best parse trees Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

Viterbi Training Improves Unsupervised Dependency Parsing Valentin - PowerPoint PPT Presentation

Viterbi Training Improves Unsupervised Dependency Parsing Valentin I. Spitkovsky with Hiyan Alshawi (Google Inc.) Daniel Jurafsky (Stanford University) and Christopher D. Manning (Stanford University) Spitkovsky et al. (Stanford & Google)

Viterbi decoder on STI CELL processor Michal Blaek (blazem2@fel.cvut.cz) Viterbi algorithm

Search and Decoding Lecture 16 CS 753 Instructor: Preethi Jyothi Recall Viterbi search Viterbi

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Living with Continual Failure Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

Illegitimi non carborundum Ronald L. Rivest Viterbi Professor of EECS MIT, Cambridge, MA CRYPTO

SOBER LIVING ENVIRONMENT Presentation WELCOME! TREATMENT Longer Treatment = Improves Outcome

Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization Shay

Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT 2015 Graham

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Amplicon Sequences Improves Associations with Clinical Information Presented by: Thomas Cowell

Reflect, Learn, Act Exploration of reflection and how it improves interactions with patients ,

Miler s Club s Club Miler Promoting Fitness Why get involved? Why get involved?

5/30/2014 Yielding Positions Prone positioning improves VQ To Prone or Not to mismatch

1 Using incremental and/or composite sampling vastly improves the representaKveness of soil or

DRONE-SUPPORTED SURVEYING Increases social inclusion Improves prospects of peace Increases

Experiments with ultracold, disordered atomic bosons Giovanni Modugno LENS and Dipartimento di

Robert J. Wilson Colorado State University 26 th International

A proposed search for Sterile Neutrinos with the ICARUS detector at the CERN-PS A. Guglielmi

The ICARUS T600 detector at LNGS underground laboratory Nicola Canci INFN-Laboratori Nazionali

New York State Universal Full-Day Prekindergarten January 16, 2015 Office of Early Learning New

ACRS MEETING WITH THE U.S. NUCLEAR REGULATORY COMMISSION April 5, 2018 Overview Mike

The Tenure Process Mary Jane Irwin Penn State University Sheila Castaneda Clarke College May

Status and Plans for H - Injection David Johnson Project X Fall Collaboration Meeting October