CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 3: Comparing frequentist and Bayesian estimation techniques Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment
Text classification The task: binary classification (e.g. sentiment analysis) Assign (sentiment) label L i ∈ { +, − } to a document W i =(w i1 ...w iN ). W 1 = “This is an amazing product: great battery life, amazing features and it’s cheap.” W 2 = “How awful. It’s buggy, saps power and is way too expensive.” The data: A set D of N documents with (or without) labels The model: Naive Bayes We will use a frequentist model and a Bayesian model and compare supervised and unsupervised estimation techniques for them. 2 Bayesian Methods in NLP
A Naive Bayes model The task: Assign (sentiment) label L i ∈ {+, − } to document W i . W 1 = “This is an amazing product: great battery life, amazing features and it’s cheap.” W 2 = “How awful. It’s buggy, saps power and is way too expensive.” The model: L i = argmax L P( L | W i ) = argmax L P( W i | L )P( L) Assume W i is a “bag of words”: W 1 = {an:1, and: 1, amazing: 2, battery: 1, cheap: 1, features: 1, great: 1,…} W 2 = {awful: 1, and: 1, buggy: 1, expensive: 1,…} P( W i | L ) is a multinomial distribution: W i ∼ Multinomial( θ L ) With a vocabulary of V words, θ L = ( θ 1 ,…., θ V ) P( L ) is a Bernoulli distribution: L ∼ Bernoulli( π ) 3 Bayesian Methods in NLP
The frequentist (maximum-likelihood) model 4 Bayesian Methods in NLP
The frequentist model The frequentist model has specific parameters θ L and π L i = argmax L P( W i | θ L )P( L | π ) P( W i | θ L ) is a multinomial over V words with parameter θ L = ( θ 1 ,…., θ V ): W i ∼ Multinomial( θ L ) P( L | π ) is a Bernoulli distribution with parameter π : L ∼ Bernoulli( π ) 5 Bayesian Methods in NLP
The frequentist model π L i w ij N i N θ L 2 6 Bayesian Methods in NLP
Supervised MLE The data is labeled: We have a set D of D documents W 1 ... W d with N words Each document W i has N i words D + documents (subset D + ) have a positive label and N + words D − documents (subset D - ) have a negative label and N - words Each word w i appears N + (w i ) times in D +, N − (w i ) times in D - Each word w i appears N j (w i ) times in D j MLE: relative frequency estimation - Labels: L ∼ Bernoulli( π ) with π = D + / d - Words: W i |+ ∼ Multinomial( θ + ) with θ i+ = N + (w i ) / N + - Words: W i | − ∼ Multinomial( θ − ) with θ i − = N - (w i ) / N - 7 Bayesian Methods in NLP
Inference with MLE The inference task: Given a new document W i+1 , what is its label L i+1 ? Recall: the word w j occurs N i+1 (w j ) times in W i+1. P ( L = + | W i +1 ) P (+) P ( W i +1 | +) ∝ V θ N i +1 ( w j ) Y = π + j j =1 8 Bayesian Methods in NLP
Unsupervised MLE The data is unlabeled: We have a set D of D documents W 1 ... W d with N words Each document W i has N i words Each word w 1 ...w i ...w V appears N j ( w i ) times in W j EM algorithm: “expected relative frequency estimation” Initialization: pick initial π (0) , θ +(0) , θ − (0) Iterate: - Labels: L ∼ Bernoulli( π ) with π (t) = 〈 N + 〉 (t-1) / 〈 N 〉 (t-1) - Words: W i |+ ∼ Multinomial( θ + ) with θ i+ (t) = 〈 N + ( w i ) 〉 (t-1) / 〈 W + 〉 (t-1) - Words: W i | − ∼ Multinomial( θ − ) with θ i − (t) = 〈 N − ( w i ) 〉 (i-1) / 〈 W − 〉 (i-1) 9 Bayesian Methods in NLP
Maximum Likelihood estimation With complete (= labeled) data D = { 〈 X i , Z i 〉 }, maximize the complete likelihood p ( X, Z | θ ) : θ * = argmax θ ∏ i p ( X i , Z i | θ ) or θ * = argmax θ ∑ i ln( p ( X i , Z i | θ )) 10 Bayesian Methods in NLP
Maximum Likelihood estimation With incomplete (= unlabeled) data , D = { 〈 X i , ? 〉 } maximize the incomplete (marginal) likelihood p ( X | θ ): θ * = argmax θ ∑ i ln( p ( X i | θ )) = argmax θ ∑ i ln( ∑ Z p ( X i , Z | θ ) p ( Z | X i , θ ’) ) = argmax θ ∑ i ln( E Z|X ᵢ , θ ’ [ p ( X i , Z | θ ) ] ) p ( Z | X , θ ): the posterior probability of Z ( X = our data) E Z|X ᵢ , θ [ p ( X i , Z | θ ) ]: the expectation of p ( X , Z | θ ) wrt . p ( Z | X , θ ) Find parameters θ new that maximize the expected log- likelihood of the joint p ( Z , X | θ new ) under p ( Z | X , θ old ) This requires an iterative approach 11 Bayesian Methods in NLP
The EM algorithm 1. Initialization: Choose initial parameters θ old 2. Expectation step: Compute p ( Z | X , θ old ) (= posterior of the latent variables Z ) 3. Maximization step: Compute θ new θ new maximizes the expected log-likelihood of the joint p ( Z , X | θ new ) under p ( Z | X , θ old ) : X θ new p ( Z | X , θ old ) ln p ( X , Z | θ ) = arg max θ Z 4. Check for convergence. Stop, or set θ old := θ new and go to 2. 12 Bayesian Methods in NLP
The EM algorithm The classes we find may not correspond to the classes we would be interested in. Seed knowledge (e.g. a few positive and negative words) may help We are not guaranteed to find a global optimum, and may get stuck in a local optimum. Initialization matters 13 Bayesian Methods in NLP
In our example... Initialization: Pick (random) π A, π B = (1- π A ), θ A , θ B E-step: Set N A ,N B , N A (w 1 ),...,N A (w V ), N B (w 1 ), ... N B (w V ) := 0 For each document W i , Set L i = A with P( L i = A | W i , π A, π B , θ A , θ B ) ∝ π A ∏ j P(w ij | θ A ) Set L i = B with P( L i = B | W i , π A, π B , θ A , θ B ) ∝ π b ∏ j P(w ij | θ B ) Update N A += P( L i = A | W i , π A, π B , θ A , θ B ) N B += P( L i =B | W i , π A, π B , θ A , θ B ) For all words w ij in W i : N A (w ij ) += P( L i = A | W i , π A, π B , θ A , θ B ) N B (w ij ) += P( L i = B | W i , π A, π B , θ A , θ B ) M-step: π A := N A /(N A + N B ) π B := N B /(N A + N B ) θ A (w i ) := N A (w i ) / ∑ j (N A (w j )) θ B (w i ) := N B (w i ) / ∑ j (N B (w j )) 14 Bayesian Methods in NLP
The Bayesian model 15 Bayesian Methods in NLP
The Bayesian model The Bayesian model has priors Dir( γ ) and Beta( α , β ) with hyperparameters γ = ( γ 1 , ..., γ V ) and α , β It does not have specific θ L and π , but integrates them out: L i = argmax L ∫∫ P( W i | θ L )P( θ L ; γ L , D ) P( L | π )P( π ; α , β , D )d θ L d π = argmax L ∫ P( W i | θ L )P( θ L ; γ L , D )d θ L ∫ P( L | π )P( π ; α , β , D )d π = argmax L P( W i | γ L , D ) P( L | α , β , D ) P( W i | θ L ) is a multinomial with parameter θ L = ( θ 1 ,…., θ V ), P( θ L ; γ L ) is a Dirichlet with hyperparameter γ L = ( γ 1 ,…., γ V ) θ L ∼ Dirichlet( γ L ) W i ∼ Multinomial( θ L ) P( L | π ) is a Bernoulli with parameter π , drawn from a Beta prior π ∼ Beta( α , β ) L ∼ Bernoulli( π ) 16 Bayesian Methods in NLP
The Bayesian model α , β π L i w ij N i N γ θ L 2 17 Bayesian Methods in NLP
Bayesian: supervised The data is labeled: We have a set D of D documents W 1 ... W D with N words Each document W i has N i words D + documents (subset D + ) have a positive label and N + words D − documents (subset D - ) have a negative label and N - words Each word w i appears N + (w i ) times in D +, N − (w i ) times in D - Each word w j appears N i (w j ) times in W i Bayesian estimation P(L = + | D ) = (D + + α ) / (D + α + β ) P(w i |+, D ) = (N + (w i ) + γ i )/(N + (w i ) + γ 0 ) P( W i | +, D ) = ∏ j P(w j | +) Ni(wj) P(L i = + | W i , D ) = [(D + + α ) / (D + α + β )] ∏ j P(w j | +) Ni(wj) 18 Bayesian Methods in NLP
Bayesian: unsupervised We need to approximate an integral/expectation: p (L i =+ | W i ) ∝ ∫∫ p ( W i |+, θ + ) p ( θ + ; γ , D ) p ( L=+| π ) p ( π ; α , β , D )d θ + d π ∝ ∫ p ( W i | +, θ + ) p ( θ + ; γ , D ) d θ + ∫ p ( L=+| π ) p ( π ; α , β , D )d π ∝ p ( W i | γ , +, D ) p (L i =+ | α , β , D ) 19 Bayesian Methods in NLP
Approximating expectations Z 1 E [ f ( x )] = f ( x ) p ( x ) dx 0 N 1 X f ( x ( i ) ) = lim N N →∞ i =1 for x (1) ...x ( i ) ...x ( N ) drawn from p ( x ) T 1 X f ( x ( i ) ) ≈ T i =1 for x (1) ...x ( i ) ...x ( T ) drawn from p ( x ) We can approximate the expectation of f(x), 〈 f(x) 〉 = ∫ f(x) p (x)dx, by sampling a finite number of points x (1) , ..., x (T) according to p (x), evaluating f(x (i) ) for each of them, and computing the average. 20 Bayesian Methods in NLP
Markov Chain Monte Carlo A multivariate distribution p( x ) = p(x 1 ,…,x k ) with discrete x i has only a finite number of possible outcomes. Markov Chain Monte Carlo methods construct a Markov chain whose states are the outcomes of p( x ) . The probability of visiting state x j is p( x j ) We sample from p( x ) by visiting a sequence of states from this Markov chain. 21 Bayesian Methods in NLP
Gibbs sampling Our states: One label assignment L 1 ,…,L N to each of our N documents x = ( L 1 ,…,L N ) Our transitions: We go from one label assignment x = (+,+,-,+,-...+) to another y = (-,+,+,+,…,+) Our intermediate steps: We generate label Y i conditioned on Y 1 ...Y i-1 and X i+1 ...X N Call label assignment Y 1 ...Y i-1 , X i+1 ...X N L (-i) We need to compute P( Y i | D, L (-i), α , β , γ ) 22 Bayesian Methods in NLP
Recommend
More recommend