spectral learning algorithms for natural language
play

Spectral Learning Algorithms for Natural Language Processing Shay - PowerPoint PPT Presentation

Spectral Learning Algorithms for Natural Language Processing Shay Cohen 1 , Michael Collins 1 , Dean Foster 2 , Karl Stratos 1 and Lyle Ungar 2 1 Columbia University 2 University of Pennsylvania June 10, 2013 Spectral Learning for NLP 1


  1. Canonical Correlation Analysis (CCA) ◮ Data consists of paired samples: ( x ( i ) , y ( i ) ) for i = 1 . . . n ◮ As in co-training, x ( i ) ∈ R d and y ( i ) ∈ R d ′ are two “views” of a sample point View 1 View 2 x (1) = (1 , 0 , 0 , 0) y (1) = (1 , 0 , 0 , 1 , 0 , 1 , 0) x (2) = (0 , 0 , 1 , 0) y (2) = (0 , 1 , 0 , 0 , 0 , 0 , 1) . . . . . . x (100000) = (0 , 1 , 0 , 0) y (100000) = (0 , 0 , 1 , 0 , 1 , 1 , 1) Spectral Learning for NLP 29

  2. Example of Paired Data: Webpage Classification (Blum and Mitchell, 98) ◮ Determine if a webpage is an course home page course 1 ↓ course home page instructor’s TA’s · · · Announcements · · · − → ← − home home Lectures · · · TAs page page · · · Information · · · ↑ course 2 ◮ View 1. Words on the page: “Announcements”, “Lectures” ◮ View 2. Identities of pages pointing to the page: instructror’s home page, related course home pages ◮ Each view is sufficient for the classification! Spectral Learning for NLP 30

  3. Example of Paired Data: Named Entity Recognition (Collins and Singer, 99) ◮ Identify an entity’s type as either Organization, Person, or Location . . . , says Mr. Cooper, a vice president of . . . ◮ View 1. Spelling features: “Mr.”, “Cooper” ◮ View 2. Contextual features: appositive=president ◮ Each view is sufficient to determine the entity’s type! Spectral Learning for NLP 31

  4. Example of Paired Data: Bigram Model (the, dog) H (I, saw) (ran, to) X Y (John, was) . . . p ( h, x, y ) = p ( h ) × p ( x | h ) × p ( y | h ) ◮ EM can be used to estimate the parameters of the model ◮ Alternatively, CCA can be used to derive vectors which can be used in a predictor     0 . 3 − 1 . 5 . .  .   .  the = ⇒ dog = ⇒ . .     1 . 1 − 0 . 4 Spectral Learning for NLP 32

  5. Projection Matrices ◮ Project samples to lower dimensional space x ∈ R d = ⇒ x ′ ∈ R p ◮ If p is small, we can learn with far fewer samples! Spectral Learning for NLP 33

  6. Projection Matrices ◮ Project samples to lower dimensional space x ∈ R d = ⇒ x ′ ∈ R p ◮ If p is small, we can learn with far fewer samples! ◮ CCA finds projection matrices A ∈ R d × p , B ∈ R d ′ × p ◮ The new data points are a ( i ) ∈ R p , b ( i ) ∈ R p where a ( i ) x ( i ) b ( i ) y ( i ) = A ⊤ = B ⊤ ���� ���� ���� ���� ���� ���� p × 1 p × d d × 1 p × 1 p × d ′ d ′ × 1 Spectral Learning for NLP 33

  7. Mechanics of CCA: Step 1 ◮ Compute ˆ C XY ∈ R d × d ′ , ˆ C XX ∈ R d × d , and ˆ C Y Y ∈ R d ′ × d ′ n C XY ] jk = 1 � ( x ( i ) x j )( y ( i ) [ ˆ − ¯ k − ¯ y k ) j n i =1 n C XX ] jk = 1 � ( x ( i ) x j )( x ( i ) [ ˆ − ¯ k − ¯ x k ) j n i =1 n C Y Y ] jk = 1 � [ ˆ ( y ( i ) y j )( y ( i ) − ¯ k − ¯ y k ) j n i =1 x = � y = � i x ( i ) /n and ¯ i y ( i ) /n where ¯ Spectral Learning for NLP 34

  8. Mechanics of CCA: Step 1 ◮ Compute ˆ C XY ∈ R d × d ′ , ˆ C XX ∈ R d × d , and ˆ C Y Y ∈ R d ′ × d ′ n C XY ] jk = 1 � ( x ( i ) x j )( y ( i ) [ ˆ − ¯ k − ¯ y k ) j n i =1 n C XX ] jk = 1 � ( x ( i ) x j )( x ( i ) [ ˆ − ¯ k − ¯ x k ) j n i =1 n C Y Y ] jk = 1 � [ ˆ ( y ( i ) y j )( y ( i ) − ¯ k − ¯ y k ) j n i =1 x = � y = � i x ( i ) /n and ¯ i y ( i ) /n where ¯ Spectral Learning for NLP 35

  9. Mechanics of CCA: Step 1 ◮ Compute ˆ C XY ∈ R d × d ′ , ˆ C XX ∈ R d × d , and ˆ C Y Y ∈ R d ′ × d ′ n C XY ] jk = 1 � ( x ( i ) x j )( y ( i ) [ ˆ − ¯ k − ¯ y k ) j n i =1 n C XX ] jk = 1 � ( x ( i ) x j )( x ( i ) [ ˆ − ¯ k − ¯ x k ) j n i =1 n C Y Y ] jk = 1 � [ ˆ ( y ( i ) y j )( y ( i ) − ¯ k − ¯ y k ) j n i =1 x = � y = � i x ( i ) /n and ¯ i y ( i ) /n where ¯ Spectral Learning for NLP 36

  10. Mechanics of CCA: Step 2 C − 1 / 2 C − 1 / 2 ◮ Do SVD on ˆ XX ˆ C XY ˆ ∈ R d × d ′ Y Y C − 1 / 2 ˆ XX ˆ C XY ˆ C − 1 / 2 = U Σ V ⊤ SVD Y Y Let U p ∈ R d × p be the top p left singular vectors. Let V p ∈ R d ′ × p be the top p right singular vectors. Spectral Learning for NLP 37

  11. Mechanics of CCA: Step 3 ◮ Define projection matrices A ∈ R d × p and B ∈ R d ′ × p A = ˆ C − 1 / 2 B = ˆ C − 1 / 2 XX U p Y Y V p ◮ Use A and B to project each ( x ( i ) , y ( i ) ) for i = 1 . . . n : x ( i ) ∈ R d = ⇒ A ⊤ x ( i ) ∈ R p y ( i ) ∈ R d ′ = ⇒ B ⊤ y ( i ) ∈ R p Spectral Learning for NLP 38

  12. Input and Output of CCA x ( i ) = (0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , . . . , 0) ∈ R 50 , 000 ↓ a ( i ) = ( − 0 . 3 . . . 0 . 1) ∈ R 100 y ( i ) = (497 , 0 , 1 , 12 , 0 , 0 , 0 , 7 , 0 , 0 , 0 , 0 , . . . , 0 , 58 , 0) ∈ R 120 , 000 ↓ b ( i ) = ( − 0 . 7 . . . − 0 . 2) ∈ R 100 Spectral Learning for NLP 39

  13. Overview Basic concepts Linear Algebra Refresher Singular Value Decomposition Canonical Correlation Analysis: Algorithm Canonical Correlation Analysis: Justification Lexical representations Hidden Markov models Latent-variable PCFGs Conclusion Spectral Learning for NLP 40

  14. Justification of CCA: Correlation Coefficients ◮ Sample correlation coefficient for a 1 . . . a n ∈ R and b 1 . . . b n ∈ R is � n a )( b i − ¯ i =1 ( a i − ¯ b ) Corr ( { a i } n i =1 , { b i } n i =1 ) = �� n �� n i =1 ( b i − ¯ a ) 2 b ) 2 i =1 ( a i − ¯ a = � b = � i a i /n , ¯ where ¯ i b i /n b Correlation ≈ 1 a Spectral Learning for NLP 41

  15. Simple Case: p = 1 ◮ CCA projection matrices are vectors u 1 ∈ R d , v 1 ∈ R d ′ ◮ Project x ( i ) and y ( i ) to scalars u 1 · x ( i ) and v 1 · y ( i ) Spectral Learning for NLP 42

  16. Simple Case: p = 1 ◮ CCA projection matrices are vectors u 1 ∈ R d , v 1 ∈ R d ′ ◮ Project x ( i ) and y ( i ) to scalars u 1 · x ( i ) and v 1 · y ( i ) ◮ What vectors does CCA find? Answer: � � { u · x ( i ) } n i =1 , { v · y ( i ) } n u 1 , v 1 = arg max Corr i =1 u,v Spectral Learning for NLP 42

  17. Finding the Next Projections ◮ After finding u 1 and v 1 , what vectors u 2 and v 2 does CCA find? Answer: � � { u · x ( i ) } n i =1 , { v · y ( i ) } n u 2 , v 2 = arg max Corr i =1 u,v subject to the constraints � � { u 2 · x ( i ) } n i =1 , { u 1 · x ( i ) } n Corr = 0 i =1 � � { v 2 · y ( i ) } n i =1 , { v 1 · y ( i ) } n Corr = 0 i =1 Spectral Learning for NLP 43

  18. CCA as an Optimization Problem ◮ CCA finds for j = 1 . . . p (each column of A and B ) � � { u · x ( i ) } n i =1 , { v · y ( i ) } n u j , v j = arg max Corr i =1 u,v subject to the constraints � � { u j · x ( i ) } n i =1 , { u k · x ( i ) } n Corr = 0 i =1 � � { v j · y ( i ) } n i =1 , { v k · y ( i ) } n Corr = 0 i =1 for k < j Spectral Learning for NLP 44

  19. Guarantees for CCA H X Y ◮ Assume data is generated from a Naive Bayes model ◮ Latent-variable H is of dimension k , variables X and Y are of dimension d and d ′ (typically k ≪ d and k ≪ d ′ ) ◮ Use CCA to project X and Y down to k dimensions (needs ( x, y ) pairs only!) ◮ Theorem: the projected samples are as good as the original samples for prediction of H (Foster, Johnson, Kakade, Zhang, 2009) ◮ Because k ≪ d and k ≪ d ′ we can learn to predict H with far fewer labeled examples Spectral Learning for NLP 45

  20. Guarantees for CCA (continued) Kakade and Foster, 2007 - cotraining-style setting: ◮ Assume that we have a regression problem: predict some value z given two “views” x and y ◮ Assumption: either view x or y is sufficient for prediction ◮ Use CCA to project x and y down to a low-dimensional space ◮ Theorem: if correlation coefficients drop off to zero quickly, we will need far fewer samples to learn when using the projected representation ◮ Very similar setting to cotraining, but: ◮ No assumption of independence between the two views ◮ CCA is an exact algorithm - no need for heuristics Spectral Learning for NLP 46

  21. Summary of the Section ◮ SVD is an efficient optimization technique ◮ Low-rank matrix approximation ◮ CCA derives a new representation of paired data that maximizes correlation ◮ SVD as a subroutine ◮ Next: use of CCA in deriving vector representations of words (“eigenwords”) Spectral Learning for NLP 47

  22. Overview Basic concepts Lexical representations ◮ Eigenwords found using the thin SVD between words and context capture distributional similarity contain POS and semantic information about words are useful features for supervised learning Hidden Markov Models Latent-variable PCFGs Conclusion Spectral Learning for NLP 48

  23. Uses of Spectral Methods in NLP ◮ Word sequence labeling ◮ Part of Speech tagging (POS) ◮ Named Entity Recognition (NER) ◮ Word Sense Disambiguation (WSD) ◮ Chunking, prepositional phrase attachment, ... ◮ Language modeling ◮ What is the most likely next word given a sequence of words (or of sounds)? ◮ What is the most likely parse given a sequence of words? Spectral Learning for NLP 49

  24. Uses of Spectral Methods in NLP ◮ Word sequence labeling: semi-supervised learning ◮ Use CCA to learn vector representation of words ( eigenwords ) on a large unlabeled corpus. ◮ Eigenwords map from words to vectors, which are used as features for supervised learning. ◮ Language modeling: spectral estimation of probabilistic models ◮ Use eigenwords to reduce the dimensionality of generative models (HMMs,...) ◮ Use those models to compute the probability of an observed word sequence Spectral Learning for NLP 50

  25. The Eigenword Matrix U ◮ U contains the singular vectors from the thin SVD of the bigram count matrix ate cheese ham I You ate 0 1 1 0 0 cheese 0 0 0 0 0 ham 0 0 0 0 0 I 1 0 0 0 0 You 2 0 0 0 0 I ate ham You ate cheese You ate Spectral Learning for NLP 51

  26. The Eigenword Matrix U ◮ U contains the singular vectors from the thin SVD of the bigram matrix ( w t − 1 ∗ w t ) analogous to LSA, but uses context instead of documents ◮ Context can be multiple neighboring words (we often use the words before and after the target) ◮ Context can be neighbors in a parse tree ◮ Eigenwords can also be computed using the CCA between words and their contexts ◮ Words close in the transformed space are distributionally, semantically and syntactically similar ◮ We will later use U in HMMs and parse trees to project words to low dimensional vectors. Spectral Learning for NLP 52

  27. Two Kinds of Spectral Models ◮ Context oblivious ( eigenwords ) ◮ learn a vector representation of each word type based on its average context ◮ Context sensitive ( eigentokens or state ) ◮ estimate a vector representation of each word token based on its particular context using an HMM or parse tree Spectral Learning for NLP 53

  28. Eigenwords in Practice ◮ Work well with corpora of 100 million words ◮ We often use trigrams from the Google n-gram collection ◮ We generally use 30-50 dimensions ◮ Compute using fast randomized SVD methods Spectral Learning for NLP 54

  29. How Big Should Eigenwords Be? ◮ A 40-D cube has 2 40 (about a trillion) vertices. ◮ More precisely, in a 40-D space about 1 . 5 40 ∼ 11 million vectors can all be approximately orthogonal. ◮ So 40 dimensions gives plenty of space for a vocabulary of a million words Spectral Learning for NLP 55

  30. Fast SVD: Basic Method problem Find a low rank approximation to a n × m matrix M . solution Find an n × k matrix A such that M ≈ AA ⊤ M Spectral Learning for NLP 56

  31. Fast SVD: Basic Method problem Find a low rank approximation to a n × m matrix M . solution Find an n × k matrix A such that M ≈ AA ⊤ M Construction A is constructed by: 1. create a random m × k matrix Ω (iid normals) 2. compute M Ω 3. Compute thin SVD of result: UDV ⊤ = M Ω 4. A = U better: iterate a couple times “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions” by N. Halko, P. G. Martinsson, and J. A. Tropp. Spectral Learning for NLP 56

  32. Eigenwords for ’Similar’ Words are Close miles 0.2 inches wife sister brother daughter uncle acres pounds son bytes father husband tons 0.1 boss mother girl degrees boy meters barrels 0.0 PC 2 guy farmer doctor woman lawyer man -0.1 teacher citizen pressure stress -0.2 gravity tension temperature permeability density viscosity -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 Spectral Learning for NLP 57

  33. Eigenwords Capture Part of Speech 0.3 river disagree agree 0.2 0.1 house PC 2 word 0.0 boat truck cat carry eat car home talk push listen dog -0.1 sleep drink -0.2 -0.2 0.0 0.2 0.4 Spectral Learning for NLP 58

  34. Eigenwords: Pronouns us 0.3 0.2 PC 2 0.1 our you i we 0.0 them they him he she -0.1 his her -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 Spectral Learning for NLP 59

  35. Eigenwords: Numbers 2007 2008 2006 0.2 2009 2005 2004 2003 2002 2001 1999 0.1 1998 1996 1997 eight nine 1995 seven 2000 five six three two four ten one 0.0 PC 2 -0.1 -0.2 10 9 7 8 6 5 3 4 -0.3 2 1 -0.4 -0.2 0.0 0.2 Spectral Learning for NLP 60

  36. Eigenwords: Names 0.15 tom mike bob joe 0.10 dan michael lisa john david 0.05 liz jennifer betty george paul PC 2 karen richard daniel linda 0.00 robert nancy susan christopher mary helen charles -0.05 thomas dorothy william maria barbara donald patricia betsy elizabeth margaret joseph -0.10 tricia -0.1 0.0 0.1 0.2 Spectral Learning for NLP 61

  37. CCA has Nice Properties for Computing Eigenwords ◮ When computing the SVD of a word × context matrix (as above) we need to decide how to scale the counts ◮ Using raw counts gives more emphasis to common words ◮ Better: rescale ◮ Divide each row by the square root of the total count of the word in that row ◮ Rescale the columns to account for the redundancy ◮ CCA between words and their contexts does this automatically and optimally ◮ CCA ’whitens’ the word-context covariance matrix Spectral Learning for NLP 62

  38. Semi-supervised Learning Problems ◮ Sequence labeleing (Named Entity Recognition, POS, WSD...) ◮ X = target word ◮ Z = context of the target word ◮ label = person / place / organization ... ◮ Topic identification ◮ X = words in title ◮ Z = words in abstract ◮ label = topic category ◮ Speaker identification: ◮ X = video ◮ Z = audio ◮ label = which character is speaking Spectral Learning for NLP 63

  39. Semi-supervised Learning using CCA ◮ Find CCA between X and Z ◮ Recall: CCA finds projection matrices A and B such that = A ⊤ = B ⊤ x x z z ���� ���� ���� ���� ���� ���� d × 1 d ′ × 1 k × d k × d ′ k × 1 k × 1 ◮ Project X and Z to estimate hidden state: ( x , z ) ◮ Note: if x is the word and z is its context, then A is the matrix of eigenwords, x is the (context oblivious) eigenword corresponding to work x , and z gives a context-sensitive “eigentoken” ◮ Use supervised learning to predict label from hidden state ◮ and from hidden state of neighboring words Spectral Learning for NLP 64

  40. Theory: CCA has Nice Properties ◮ If one uses CCA to map from target word and context (two views, X and Z ) to reduced dimension hidden state and then uses that hidden state as features in a linear regression to predict a y , then we have provably almost as good a fit in the reduced dimsion (e.g. 40) as in the original dimension (e.g. million word vocabulary). ◮ In contrast, Principal Components Regression (PCR: regression based on PCA, which does not “whiten” the covariance matrix) can miss all the signal [Foster and Kakade, ’06] Spectral Learning for NLP 65

  41. Semi-supervised Results ◮ Find spectral features on unlabeled data ◮ RCV-1 corpus: Newswire ◮ 63 million tokens in 3.3 million sentences. ◮ Vocabulary size: 300k ◮ Size of embeddings: k = 50 ◮ Use in discriminative model ◮ CRF for NER ◮ Averaged perceptron for chunking ◮ Compare against state-of-the-art embeddings ◮ C&W, HLBL, Brown, ASO and Semi-Sup CRF ◮ Baseline features based on identity of word and its neighbors ◮ Benefit ◮ Named Entity Recognition (NER): 8% error reduction ◮ Chunking: 29% error reduction ◮ Add spectral features to discriminative parser: 2.6% error reduction Spectral Learning for NLP 66

  42. Section Summary ◮ Eigenwords found using thin SVD between words and context ◮ capture distributional similarity ◮ contain POS and semantic information about words ◮ perform competitively to a wide range of other embeddings ◮ CCA version provides provable guarantees when used as features in supervised learning ◮ Next: eigenwords form the basis for fast estimation of HMMs and parse trees Spectral Learning for NLP 67

  43. A Spectral Learning Algorithm for HMMs ◮ Algorithm due to Hsu, Kakade and Zhang (COLT 2009; JCSS 2012) ◮ Algorithm relies on singular value decomposition followed by very simple matrix operations ◮ Close connections to CCA ◮ Under assumptions on singular values arising from the model, has PAC-learning style guarantees (contrast with EM, which has problems with local optima) ◮ It is a very different algorithm from EM Spectral Learning for NLP 68

  44. Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him p ( the dog saw him , 1 2 1 3 ) � �� � � �� � h 1 ...h 4 x 1 ...x 4 = π (1) × t (2 | 1) × t (1 | 2) × t (3 | 1) Spectral Learning for NLP 69

  45. Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him p ( the dog saw him , 1 2 1 3 ) � �� � � �� � h 1 ...h 4 x 1 ...x 4 = π (1) × t (2 | 1) × t (1 | 2) × t (3 | 1) × o ( the | 1) × o ( dog | 2) × o ( saw | 1) × o ( him | 3) Spectral Learning for NLP 69

  46. Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him p ( the dog saw him , 1 2 1 3 ) � �� � � �� � h 1 ...h 4 x 1 ...x 4 = π (1) × t (2 | 1) × t (1 | 2) × t (3 | 1) × o ( the | 1) × o ( dog | 2) × o ( saw | 1) × o ( him | 3) ◮ Initial parameters: π ( h ) for each latent state h ◮ Transition parameters: t ( h ′ | h ) for each pair of states h ′ , h ◮ Observation parameters: o ( x | h ) for each state h , obs. x Spectral Learning for NLP 69

  47. Hidden Markov Models (HMMs) H 1 H 2 H 3 H 4 dog saw the him Throughout this section: ◮ We use m to refer to the number of hidden states ◮ We use n to refer to the number of possible words (observations) ◮ Typically, m ≪ n (e.g., m = 20 , n = 50 , 000 ) Spectral Learning for NLP 70

  48. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 Spectral Learning for NLP 71

  49. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: Spectral Learning for NLP 71

  50. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: f 0 h = π ( h ) Spectral Learning for NLP 71

  51. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ Spectral Learning for NLP 71

  52. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 h = h ′ h ′ Spectral Learning for NLP 71

  53. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 f 3 t ( h | h ′ ) o ( saw | h ′ ) f 2 h = h = h ′ h ′ h ′ h ′ Spectral Learning for NLP 71

  54. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 f 3 t ( h | h ′ ) o ( saw | h ′ ) f 2 h = h = h ′ h ′ h ′ h ′ � f 4 t ( h | h ′ ) o ( him | h ′ ) f 3 h = h ′ h ′ Spectral Learning for NLP 71

  55. HMMs: the forward algorithm H 1 H 2 H 3 H 4 dog saw the him � p ( the dog saw him ) = p ( the dog saw him , h 1 h 2 h 3 h 4 ) h 1 ,h 2 ,h 3 ,h 4 The forward algorithm: � f 0 f 1 t ( h | h ′ ) o ( the | h ′ ) f 0 h = π ( h ) h = h ′ h ′ � � f 2 t ( h | h ′ ) o ( dog | h ′ ) f 1 f 3 t ( h | h ′ ) o ( saw | h ′ ) f 2 h = h = h ′ h ′ h ′ h ′ � � f 4 t ( h | h ′ ) o ( him | h ′ ) f 3 f 4 h = p ( . . . ) = h ′ h h ′ h Spectral Learning for NLP 71

  56. HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him Spectral Learning for NLP 72

  57. HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him Spectral Learning for NLP 72

  58. HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) Spectral Learning for NLP 72

  59. HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) e.g., [ A the ] h ′ ,h = t ( h ′ | h ) o ( the | h ) Spectral Learning for NLP 72

  60. HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) e.g., [ A the ] h ′ ,h = t ( h ′ | h ) o ( the | h ) ◮ Define π as vector with elements π h , 1 as vector of all ones Spectral Learning for NLP 72

  61. HMMs: the forward algorithm in matrix form H 1 H 2 H 3 H 4 dog saw the him ◮ For each word x , define the matrix A x ∈ R m × m as [ A x ] h ′ ,h = t ( h ′ | h ) o ( x | h ) e.g., [ A the ] h ′ ,h = t ( h ′ | h ) o ( the | h ) ◮ Define π as vector with elements π h , 1 as vector of all ones ◮ Then p ( the dog saw him ) = 1 ⊤ × A him × A saw × A dog × A the × π Forward algorithm through matrix multiplication! Spectral Learning for NLP 72

  62. The Spectral Algorithm: definitions H 1 H 2 H 3 H 4 dog saw the him Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) Easy to derive an estimate: P 2 , 1 ] i,j = Count ( X 2 = i, X 1 = j ) [ ˆ N Spectral Learning for NLP 73

  63. The Spectral Algorithm: definitions H 1 H 2 H 3 H 4 dog saw the him For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) Easy to derive an estimate, e.g.,: P 3 , dog , 1 ] i,j = Count ( X 3 = i, X 2 = dog , X 1 = j ) [ ˆ N Spectral Learning for NLP 74

  64. Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) Spectral Learning for NLP 75

  65. Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) ◮ SVD ( P 2 , 1 ) ⇒ U ∈ R n × m , Σ ∈ R m × m , V ∈ R n × m Spectral Learning for NLP 75

  66. Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) ◮ SVD ( P 2 , 1 ) ⇒ U ∈ R n × m , Σ ∈ R m × m , V ∈ R n × m ◮ Definition: B x = U ⊤ × P 3 ,x, 1 × V × Σ − 1 ���� � �� � m × m m × m Spectral Learning for NLP 75

  67. Main Result Underlying the Spectral Algorithm ◮ Define the following matrix P 2 , 1 ∈ R n × n : [ P 2 , 1 ] i,j = P ( X 2 = i, X 1 = j ) ◮ For each word x , define the following matrix P 3 ,x, 1 ∈ R n × n : [ P 3 ,x, 1 ] i,j = P ( X 3 = i, X 2 = x, X 1 = j ) ◮ SVD ( P 2 , 1 ) ⇒ U ∈ R n × m , Σ ∈ R m × m , V ∈ R n × m ◮ Definition: B x = U ⊤ × P 3 ,x, 1 × V × Σ − 1 ���� � �� � m × m m × m ◮ Theorem: if P 2 , 1 is of rank m , then B x = GA x G − 1 where G ∈ R m × m is invertible Spectral Learning for NLP 75

  68. Why does this matter? ◮ Theorem: if P 2 , 1 is of rank m , then B x = GA x G − 1 where G ∈ R m × m is invertible ◮ Recall p ( the dog saw him ) = 1 ⊤ A him A saw A dog A the π . Forward algorithm through matrix multiplication! Spectral Learning for NLP 76

  69. Why does this matter? ◮ Theorem: if P 2 , 1 is of rank m , then B x = GA x G − 1 where G ∈ R m × m is invertible ◮ Recall p ( the dog saw him ) = 1 ⊤ A him A saw A dog A the π . Forward algorithm through matrix multiplication! ◮ Now note that B him × B saw × B dog × B the Spectral Learning for NLP 76

Recommend


More recommend