Spectral Methods for Natural Language Processing Karl Stratos Thesis Defense Committee David Blei, Michael Collins, Daniel Hsu, Slav Petrov, and Owen Rambow 1 / 53
Latent-Variable Models in NLP Models with latent/hidden variables are widely used for unsupervised and semi-supervised NLP tasks. Some examples: 1. Word clustering (Brown et al., 1992) 2. Syntactic parsing (Matsuzaki et al., 2005; Petrov et al., 2006) 3. Label induction (Haghighi and Klein 2006; Berg-Kirkpatrick et al., 2010) 4. Machine translation (Brown et al., 1993) 2 / 53
Computational Challenge latent variables − → (generally) intractable computation ◮ Learning HMMs: intractable (Terwijn, 2002) ◮ Learning topic models: NP-hard (Arora et al., 2012) ◮ Many other hardness results Common approach: EM , gradient-based search (SGD, L-BFGS) ◮ No global optimality guaranteed! ◮ Heuristics in this sense 3 / 53
Why Not Heuristics? Heuristics are often sufficient for empirical purposes. ◮ EM, SGD, L-BFGS: remarkably successful training methods ◮ Do have weak guarantees (convergence to a local optimum) ◮ Ways to deal with local optima issues (careful initialization, random restarts, . . . ) “So why not just use heuristics?” 4 / 53
Why Not Heuristics? Heuristics are often sufficient for empirical purposes. ◮ EM, SGD, L-BFGS: remarkably successful training methods ◮ Do have weak guarantees (convergence to a local optimum) ◮ Ways to deal with local optima issues (careful initialization, random restarts, . . . ) “So why not just use heuristics?” At least two downsides: 1. Impedes the development of new theoretical frameworks No new understanding of problems for better solutions 2. Limited guidance of rigorous theory Black art tricks, unreliable and difficult to reproduce 4 / 53
This Thesis Derives algorithms for latent-variable models in NLP with provable guarantees . Main weapon SPECTRAL METHODS (i.e., methods that use singular value decomposition (SVD) or other similar factorization) 5 / 53
This Thesis Derives algorithms for latent-variable models in NLP with provable guarantees . Main weapon SPECTRAL METHODS (i.e., methods that use singular value decomposition (SVD) or other similar factorization) Stands on the shoulders of many giants: ◮ Guaranteed learning of GMMs (Dasgupta, 1999) ◮ Dimensionality reduction with CCA (Kakade and Foster, 2007) ◮ Guaranteed learning of HMMs (Hsu et al., 2008) ◮ Guaranteed learning of topic models (Arora et al., 2012) 5 / 53
Main Contributions Novel spectral algorithms for two NLP tasks Task 1 . Learning lexical representations (UAI 2014) First provably correct algorithm for clustering words under the language model of Brown et al. (“Brown clustering”) (ACL 2015) New model-based interpretation of smoothed CCA for deriving word embeddings 6 / 53
Main Contributions Novel spectral algorithms for two NLP tasks Task 1 . Learning lexical representations (UAI 2014) First provably correct algorithm for clustering words under the language model of Brown et al. (“Brown clustering”) (ACL 2015) New model-based interpretation of smoothed CCA for deriving word embeddings Task 2 . Estimating latent-variable models for NLP (TACL 2016) Consistent estimator of a model for unsupervised part-of-speech (POS) tagging (CoNLL 2013) Consistent estimator of a model for supervised phoneme recognition 6 / 53
Overview Introduction Learning Lexical Representations A Spectral Algorithm for Brown Clustering A Model-Based Approach for CCA Word Embeddings Estimating Latent-Variable Models for NLP Unsupervised POS Tagging with Anchor HMMs Supervised Phoneme Recognition with Refinement HMMs Concluding Remarks 7 / 53
Motivation Brown clustering algorithm (Brown et al., 1992) ◮ An agglomerative word clustering method ◮ Popular for semi-supervised NLP (Miller et al., 2004; Koo et al., 2008) This method assumes an underlying clustering of words, but is not guaranteed to recover the correct clustering. This work : ◮ Derives a spectral algorithm with a guarantee of recovering the underlying clustering. ◮ Also empirically much faster (up to ∼ 10 times) 8 / 53
Original Clustering Scheme of Brown et al. (1992) BrownAlg Input : sequence of words x 1 . . . x N in vocabulary V , number of clusters m 1. Initialize each w ∈ V to be its own cluster. 2. For |V| − 1 times, merge a pair of clusters that yields the smallest decrease in � � � � � p x 1 . . . x N � Brown model when merged. 3. Return a pruning of the resulting tree with m leaf clusters. 0 1 m = 4 00 coffee tea 00 01 10 11 01 dog cat 000 001 010 011 100 101 110 111 10 walk run coffee tea dog cat run walk walked ran 11 walked ran 9 / 53
Brown Model = Restricted HMM · · · // unobserved 3 26 7 · · · // observed product was Their 10 / 53
Brown Model = Restricted HMM · · · // unobserved 3 26 7 · · · // observed product was Their ◮ Hidden states: m word classes { 1 . . . m } ◮ Observed states: n word types { 1 . . . n } ◮ Restriction. Word x belongs to exactly one class C ( x ) . N N � � p ( x 1 . . . x N ) = π C ( x 1 ) × T C ( x i ) ,C ( x i − 1 ) × O x i ,C ( x i ) i =2 i =1 10 / 53
Brown Model = Restricted HMM · · · // unobserved 3 26 7 · · · // observed product was Their ◮ Hidden states: m word classes { 1 . . . m } ◮ Observed states: n word types { 1 . . . n } ◮ Restriction. Word x belongs to exactly one class C ( x ) . N N � � p ( x 1 . . . x N ) = π C ( x 1 ) × T C ( x i ) ,C ( x i − 1 ) × O x i ,C ( x i ) i =2 i =1 The model assumes a true class C ( x ) for each word x . BrownAlg is a greedy heuristic with no guarantee of recovering C ( x ) . 10 / 53
Derivation of a Spectral Algorithm Key observation. Given the emission parameters O x,c , we can trivially recover the true clustering (by the model restriction). 1 2 0 . 3 0 cringe smile 0 . 7 0 grin O = 0 0 . 2 frown frown 0 0 . 8 cringe smile grin Algorithm: put words x, x ′ in the same cluster iff O x O x ′ || O x || = || O x ′ || 11 / 53
SVD Recovers the Emission Parameters Theorem . Let U Σ V ⊤ be a rank- m SVD of Ω defined by p ( x, x ′ ) Ω x,x ′ := � p ( x ) × p ( x ′ ) Then for some orthogonal Q ∈ R m × m , √ OQ ⊤ U = Corollary : words x, x ′ are in the same cluster iff U x U x ′ || U x || = || U x ′ || 12 / 53
Clustering with Empirical Estimates � Ω := empirical estimate of Ω from N samples x 1 . . . x N count ( x, x ′ ) � Ω x,x ′ := � . count ( x ) × count ( x ′ ) V ⊤ := rank- m SVD of � U � � Σ � Ω 13 / 53
Clustering with Empirical Estimates � Ω := empirical estimate of Ω from N samples x 1 . . . x N count ( x, x ′ ) � Ω x,x ′ := � . count ( x ) × count ( x ′ ) V ⊤ := rank- m SVD of � U � � Σ � Ω The Guarantee . If N is large enough (polynomial in the con- dition number of Ω ), C ( x ) is given by some m -pruning of an agglomerative clustering of � � � � � � � � f ( x ) := � ˆ �� U x / U x � � � 13 / 53
Clustering with Empirical Estimates � Ω := empirical estimate of Ω from N samples x 1 . . . x N count ( x, x ′ ) � Ω x,x ′ := � . count ( x ) × count ( x ′ ) V ⊤ := rank- m SVD of � U � � Σ � Ω The Guarantee . If N is large enough (polynomial in the con- dition number of Ω ), C ( x ) is given by some m -pruning of an agglomerative clustering of � � � � � � � � f ( x ) := � ˆ �� U x / U x � � � � � � � � � � � � Ω − � Proof sketch. Large N ensures small Ω � , which ensures the � � strict separation property for the distance between ˆ f ( x ) : � � � � � � � � � � � � � � � � � ˆ f ( x ) − ˆ � ˆ f ( x ) − ˆ C ( x ) = C ( x ′ ) � = C ( x ′′ ) = f ( x ′ ) f ( x ′′ ) ⇒ � < � � � � � The claim follows from Balcan et al. (2008). 13 / 53
Summary of the Algorithm ◮ Compute an empirical estimate � Ω from unlabeled text. count ( x, x ′ ) � Ω x,x ′ := � count ( x ) × count ( x ′ ) 14 / 53
Summary of the Algorithm ◮ Compute an empirical estimate � Ω from unlabeled text. count ( x, x ′ ) � Ω x,x ′ := � count ( x ) × count ( x ′ ) ◮ Compute a rank- m SVD : Ω ≈ � � U � Σ � V ⊤ 14 / 53
Summary of the Algorithm ◮ Compute an empirical estimate � Ω from unlabeled text. count ( x, x ′ ) � Ω x,x ′ := � count ( x ) × count ( x ′ ) ◮ Compute a rank- m SVD : Ω ≈ � � U � Σ � V ⊤ � � � � � � � � ◮ Agglomeratively cluster the normalized rows � �� U x / � U x � � . 14 / 53
Summary of the Algorithm ◮ Compute an empirical estimate � Ω from unlabeled text. count ( x, x ′ ) � Ω x,x ′ := � count ( x ) × count ( x ′ ) ◮ Compute a rank- m SVD : Ω ≈ � � U � Σ � V ⊤ � � � � � � � � ◮ Agglomeratively cluster the normalized rows � �� U x / � U x � � . ◮ Return a pruning of the hierarchy into m leaf clusters. 00 coffee tea 0 1 01 dog cat 00 01 10 11 10 walk run 000 001 010 011 100 101 110 111 11 coffee tea dog cat run walk walked ran walked ran 14 / 53
Recommend
More recommend