Optimizing Spectral Learning for Parsing Shashi Narayan, Shay Cohen School of Informatics, University of Edinburgh ACL, August 2016 1 / 1
Probabilistic CFGs with Latent States (Matsuzaki et al., 2005; Prescher 2005) S S 1 NP 3 VP 2 NP VP D 1 N 2 V 4 NP 5 D N V NP ⇒ D 1 N 4 the dog saw D N the dog saw the cat the cat Latent states play the role of nonterminal subcategorization, e.g., NP → { NP 1 , NP 2 , . . . , NP 24 } ◮ analogous to syntactic heads as in lexicalization (Charniak 1997) ? They are not part of the observed data in the treebank 2 / 1
Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood 3 / 1
Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood Spectral Algorithm (Cohen et al., 2012, 2014) ⇑ Statistically consistent algorithms that make use of spectral decomposition ⇑ Much faster training than the EM algorithm 3 / 1
Estimating PCFGs with Latent States (L-PCFGs) EM Algorithm (Matsuzaki et al., 2005; Petrov et al., 2006) ⇓ Problems with local maxima ; it fails to provide certain type of theoretical guarantees as it doesn’t find global maximum of the log-likelihood Spectral Algorithm (Cohen et al., 2012, 2014) ⇑ Statistically consistent algorithms that make use of spectral decomposition ⇑ Much faster training than the EM algorithm ⇓ Lagged behind in their empirical results 3 / 1
Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation 4 / 1
Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation Contributions: A . Parsing results significantly improve if the number of latent states for each nonterminal is globally optimized ◮ Petrov et al. (2006) demonstrated that coarse-to-fine techniques that carefully select the number of latent states improve accuracy. 4 / 1
Overview Builds on the work on the spectral algorithm for Latent-state PCFGs (L-PCFGs) for parsing (Cohen et al., 2012, 2014, Cohen and Collins, 2014, Narayan and Cohen 2015) Conventional approach: Number of latent states for each nonterminal in an L-PCFG can be decided in isolation Contributions: B . Optimized spectral method beats coarse-to-fine expectation-maximization techniques on 6 ( Basque , Hebrew , Hungarian , Korean , Polish and Swedish ) out of 8 SPMRL datasets 4 / 1
Intuition behind the Spectral Algorithm Inside and outside trees At node VP: Outside tree o = S S NP VP D N NP VP the dog D N V P Inside tree t = VP V P the dog saw him saw him Conditionally independent given the label and the hidden state p ( o , t | VP , h ) = p ( o | VP , h ) × p ( t | VP , h ) 5 / 1
Recent Advances in Spectral Estimation = Singular value decomposition (SVD) of cross-covariance matrix for each nonterminal 6 / 1
Recent Advances in Spectral Estimation = SVD Step Method of moments (Cohen et al., 2012, 2014) ◮ Averaging with SVD parameters ⇒ Dense estimates 7 / 1
Recent Advances in Spectral Estimation = SVD Step Method of moments (Cohen et al., 2012, 2014) ◮ Averaging with SVD parameters ⇒ Dense estimates Clustering variants (Narayan and Cohen 2015) S [ 1 ] S ( 1 , 1 , 0 , 1 , . . . ) NP [ 4 ] VP [ 3 ] NP VP D [ 7 ] N [ 4 ] V [ 1 ] N [ 1 ] D N V N w 0 w 1 w 2 w 3 w 0 w 1 w 2 w 3 Sparse estimates 7 / 1
Standard Spectral Estimation and Number of Latent States = ⇑ A natural way to choose the number of latent states based on the number of non-zero singular values ⇑ Number of latent states for each nonterminal in an L-PCFG can be decided in isolation ⇓ Conventional approach fails to take into account interactions between different nonterminals 8 / 1
Optimizing Latent States for Various Nonterminals Input: ◮ An input treebank divided into training and development set ◮ A basic spectral estimation algorithm S mapping each nonterminal to a fixed number of latent states ◮ f def : { S → 24, NNP → 24, VP → 24, DT → 24, . . . } Output: ◮ f opt : { S → 40, NNP → 81, VP → 35, DT → 4, . . . } 9 / 1
Optimizing Latent States for Various Nonterminals Algorithm in a nutshell ◮ Iterate through the nonterminals, changing the number of latent states, ◮ estimate the grammar on the training set and ◮ optimize the accuracy on the dev set A beam search algorithm for the traversal of multidimensional vectors of latent states: Optimizing their global interaction 10 / 1
Optimizing Latent States for Various Nonterminals DT S NP f def : , F1 def 24 24 24 24 24 24 time:0 11 / 1
Optimizing Latent States for Various Nonterminals DT S NP , F1 def f def : 4 37 24 24 24 24 DT S NP m 1 f m 1 : m 1 , F1 m 1 4 37 24 24 24 DT S NP m 2 f m 2 : m 2 , F1 m 2 4 37 24 24 24 time:t DT S NP m N f m N : m N , F1 m N 4 37 24 24 24 11 / 1
Optimizing Latent States for Various Nonterminals DT S NP , F1 def f def : 4 37 24 24 24 24 DT S NP m 1 f m 1 : m 1 , F1 m 1 4 37 24 24 24 DT S NP m 2 f m 2 : m 2 , F1 m 2 4 37 24 24 24 time:t DT S NP m N f m N : m N , F1 m N 4 37 24 24 24 Clustering variant of spectral estimation leads to compact models and is relatively fast 11 / 1
Experiments The SPMRL Dataset 8 morphologically rich languages: Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish Treebanks of varying sizes from 5,000 sentences (Hebrew and Swedish) to 40,472 sentences (German) 12 / 1
Results on the Swedish dataset Results on the dev set 85 80 75.50 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1
Results on the Swedish dataset Results on the dev set 85 80 75.50 75.20 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1
Results on the Swedish dataset Results on the dev set 85 80 75.50 75.50 75.20 73.40 71.40 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 13 / 1
Results on the Swedish dataset Final results on the test set 90 80.90 80.60 85 79.40 78.40 76.40 80 75 F-Measures 70 65 60 55 50 berkeley cluster moments Petrov et al.’06 Narayan and Cohen’15 Cohen et al.’13 Bj¨ orkelund et al.’13 14 / 1
Final Results on the SPMRL Dataset 100 Berkeley 91.8 Spectral Optimized 89.2 89.0 87.0 86.8 90 85.2 81.4 80.9 80.6 80.4 80.0 79.1 78.6 78.3 78.2 80 74.7 F-Measures 70 60 50 Basque French German Hebrew Hungarian Korean Polish Swedish ◮ Berkeley results are taken from Bj¨ orkelund et al, 2013 . 15 / 1
Conclusion Spectral parsing results significantly improve if the number of latent states for each nonterminal is globally optimized Optimized spectral algorithm beats coarse-to-fine EM algorithm for 6 ( Basque , Hebrew , Hungarian , Korean , Polish and Swedish ) out of 8 SPMRL datasets The Rainbow parser and multilingual models : http://cohort.inf.ed.ac.uk/lpcfg/ Acknowledgments: Thanks to David McClosky, Eugene Charniak, DK Choe, Geoff Gordon, Djam´ e Seddah, Thomas M¨ uller, Anders Bj¨ orkelund and anonymous reviewers 16 / 1
Inside Features used Consider the VP node in the following tree: S NP VP D N V NP D N the cat saw the dog The inside features consist of: ◮ The pairs (VP, V) and (VP, NP) ◮ The rule VP → V NP ◮ The tree fragment (VP (V saw) NP) ◮ The tree fragment (VP V (NP D N)) ◮ The pair of head part-of-speech tag with VP: (VP, V) 17 / 1
Outside Features used Consider the D node in the following tree: S NP VP D N V NP the cat saw D N the dog The outside features consist of: ◮ The pairs (D, NP) and (D, NP, VP) ◮ The pair of head part-of-speech tag with D: (D, N) ◮ The tree fragments , and NP VP S D* N V NP NP VP D* N V NP D* N 18 / 1
Variants of Spectral Estimation ◮ SVD variants: singular value decomposition of empirical count matrices (cross-covariance matrices) to estimate grammar parameters (Cohen et. al. 2012, 2014) ◮ Convex EM variant: “anchor method” that identifies features that uniquely identify latent states (Cohen and Collins, 2014) ◮ Clustering variant: a simplified version of the SVD variant that clusters low-dimensional representations to latent states (Narayan and Cohen, 2015) Intuitive-to-understand and very (computationally) efficient 19 / 1
Recommend
More recommend