Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21
Naive Bayes Bayes‘ Rule: p ( t | x ) = p ( x | t ) p ( t ) p ( x ) Naive Bayes Assumption: D ∏ p ( x | t ) = p ( x j | t ) j =1 Likelihood function: L ( θ ) = p ( x , t | θ ) = p ( x | t , θ ) p ( t | θ ) Naive Bayes and Gaussian Bayes Classifier February 22, 2016 2 / 21
Example: Spam Classification Each vocabulary is one feature dimension. We encode each email as a feature vector x ∈ { 0 , 1 } | V | x j = 1 iff the vocabulary x j appears in the email. We want to model the probability of any word x j appearing in an email given the email is spam or not. Example: $10,000, Toronto, Piazza, etc. Idea: Use Bernoulli distribution to model p ( x j | t ) Example: p (“$10 , 000” | spam) = 0 . 3 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 3 / 21
Bernoulli Naive Bayes Assuming all data points x ( i ) are i.i.d. samples, and p ( x j | t ) follows a Bernoulli distribution with parameter µ jt D x ( i ) jt ( i ) (1 − µ jt ( i ) ) (1 − x ( i ) ∏ ) p ( x ( i ) | t ( i ) ) = j µ j j =1 N N D x ( i ) jt ( i ) (1 − µ jt ( i ) ) (1 − x ( i ) ) ∏ p ( t ( i ) ) p ( x ( i ) | t ( i ) ) = ∏ p ( t ( i ) ) ∏ j p ( t | x ) ∝ µ j i =1 i =1 j =1 where p ( t ) = π t . Parameters π t , µ jt can be learnt using maximum likelihood. Naive Bayes and Gaussian Bayes Classifier February 22, 2016 4 / 21
Derivation of maximum likelihood estimator (MLE) θ = [ µ, π ] log L ( θ ) = log p ( x , t | θ ) N D x ( i ) log µ jt ( i ) + (1 − x ( i ) ∑ ∑ = log π t ( i ) + j ) log(1 − µ jt ( i ) ) j i =1 j =1 Want: arg max θ log L ( θ ) subject to ∑ k π k = 1 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 5 / 21
Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ x ( i ) 1 − x ( i ) N ∂ log L ( θ ) ( t ( i ) = k ) j j ∑ = 0 = 0 ⇒ − 1 ∂µ jk µ jk 1 − µ jk i =1 N ( t ( i ) = k ) [ ( ) ] x ( i ) 1 − x ( i ) ∑ j (1 − µ jk ) − µ jk = 0 1 j i =1 N N ( t ( i ) = k ) ( t ( i ) = k ) x ( i ) ∑ ∑ µ jk = 1 1 j i =1 i =1 t ( i ) = k x ( i ) ∑ N ( ) i =1 1 j µ jk = ∑ N t ( i ) = k ( ) i =1 1 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 6 / 21
Derivation of maximum likelihood estimator (MLE) Use Lagrange multiplier to derive π ) 1 N ∂ L ( θ ) + λ∂ ∑ κ π κ ( t ( i ) = k ) ∑ = 0 ⇒ λ = − 1 ∂π k ∂π k π k i =1 t ( i ) = k ) ∑ N ( ) i =1 1 π k = − λ Apply constraint: ∑ k π k = 1 ⇒ λ = − N t ( i ) = k ) ∑ N ( ) i =1 1 π k = N Naive Bayes and Gaussian Bayes Classifier February 22, 2016 7 / 21
Spam Classification Demo Naive Bayes and Gaussian Bayes Classifier February 22, 2016 8 / 21
Gaussian Bayes Classifier Instead of assuming conditional independence of x j , we model p ( x | t ) as a Gaussian distribution and the dependence relation of x j is encoded in the covariance matrix. Multivariate Gaussian distribution: 1 ( − 1 ) 2( x − µ ) T Σ − 1 ( x − µ ) f ( x ) = exp (2 π ) D det(Σ) √ µ : mean, Σ: covariance matrix, D : dim( x ) Naive Bayes and Gaussian Bayes Classifier February 22, 2016 9 / 21
Derivation of maximum likelihood estimator (MLE) √ (2 π ) D det(Σ) θ = [ µ, Σ , π ] , Z = p ( x | t ) = 1 ( − 1 ) 2( x − µ ) T Σ − 1 ( x − µ ) Z exp log L ( θ ) = log p ( x , t | θ ) = log p ( t | θ ) + log p ( x | t , θ ) N log π t ( i ) − log Z − 1 ) T ( x ( i ) − µ t ( i ) ( x ( i ) − µ t ( i ) ) ∑ Σ − 1 = t ( i ) 2 i =1 Want: arg max θ log L ( θ ) subject to ∑ k π k = 1 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 10 / 21
Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. µ N ∂ log L ( t ( i ) = k ) Σ − 1 ( x ( i ) − µ k ) = 0 ∑ = − 1 ∂µ k i =0 t ( i ) = k ∑ N ( ) x ( i ) i =1 1 µ k = ∑ N t ( i ) = k ( ) i =1 1 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 11 / 21
Derivation of maximum likelihood estimator (MLE) Take derivative w.r.t. Σ − 1 (not Σ) Note: ∂ det( A ) = det( A ) A − 1 T ∂ A det( A ) − 1 = det A − 1 ) ( ∂ x T Ax = xx T ∂ A Σ T = Σ N ) [ ] ∂ log L − ∂ log Z k − 1 2( x ( i ) − µ k )( x ( i ) − µ k ) T ( t ( i ) = k ∑ = − = 0 1 ∂ Σ − 1 ∂ Σ − 1 k k i =0 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 12 / 21
Derivation of maximum likelihood estimator (MLE) √ (2 π ) D det(Σ k ) Z k = ) − 1 Σ − 1 ( 2 ∂ det ∂ log Z k = 1 2 ∂ Z k 2 det(Σ k ) − 1 = (2 π ) − D D 2 (2 π ) k ∂ Σ − 1 ∂ Σ − 1 ∂ Σ − 1 Z k k k k ( − 1 ) k = − 1 ) − 3 1 2 det = det(Σ − 1 Σ − 1 Σ − 1 ( ( ) Σ T k ) det 2Σ k 2 k k 2 N ) [ 1 ] ∂ log L 2Σ k − 1 t ( i ) = k 2( x ( i ) − µ k )( x ( i ) − µ k ) T ( ∑ = − = 0 1 ∂ Σ − 1 k i =0 t ( i ) = k x ( i ) − µ k x ( i ) − µ k ) T ∑ N ( ) ( ) ( i =1 1 Σ k = t ( i ) = k ∑ N ( ) i =1 1 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 13 / 21
Derivation of maximum likelihood estimator (MLE) t ( i ) = k ) ∑ N ( ) i =1 1 π k = N (Same as Bernoulli) Naive Bayes and Gaussian Bayes Classifier February 22, 2016 14 / 21
Gaussian Bayes Classifier Demo Naive Bayes and Gaussian Bayes Classifier February 22, 2016 15 / 21
Gaussian Bayes Classifier If we constrain Σ to be diagonal, then we can rewrite p ( x j | t ) as a product of p ( x j | t ) 1 ( − 1 ) 2( x j − µ jt ) T Σ − 1 p ( x | t ) = exp t ( x k − µ kt ) (2 π ) D det(Σ t ) √ D D 1 ( 1 ) ∏ || x j − µ jt || 2 ∏ = exp − = p ( x j | t ) 2 √ 2Σ t , jj (2 π ) D Σ t , jj j =1 j =1 Diagonal covariance matrix satisfies the naive Bayes assumption. Naive Bayes and Gaussian Bayes Classifier February 22, 2016 16 / 21
Gaussian Bayes Classifier Case 1: The covariance matrix is shared among classes p ( x | t ) = N ( x | µ t , Σ) Case 2: Each class has its own covariance p ( x | t ) = N ( x | µ t , Σ t ) Naive Bayes and Gaussian Bayes Classifier February 22, 2016 17 / 21
Gaussian Bayes Binary Classifier Decision Boundary If the covariance is shared between classes, p ( x , t = 1) = p ( x , t = 0) log π 1 − 1 2( x − µ 1 ) T Σ − 1 ( x − µ 1 ) = log π 0 − 1 2( x − µ 0 ) T Σ − 1 ( x − µ 0 ) C + x T Σ − 1 x − 2 µ T 1 Σ − 1 x + µ T 1 Σ − 1 µ 1 = x T Σ − 1 x − 2 µ T 0 Σ − 1 x + µ T 0 Σ − 1 µ 0 [ 2( µ 0 − µ 1 ) T Σ − 1 ] x − ( µ 0 − µ 1 ) T Σ − 1 ( µ 0 − µ 1 ) = C ⇒ a T x − b = 0 The decision boundary is a linear function (a hyperplane in general). Naive Bayes and Gaussian Bayes Classifier February 22, 2016 18 / 21
Relation to Logistic Regression We can write the posterior distribution p ( t = 0 | x ) as p ( x , t = 0) π 0 N ( x | µ 0 , Σ) p ( x , t = 0) + p ( x , t = 1) = π 0 N ( x | µ 0 , Σ) + π 1 N ( x | µ 1 , Σ) ]} − 1 { [ 1 + π 1 − 1 2( x − µ 1 ) T Σ − 1 ( x − µ 1 ) + 1 2( x − µ 0 ) T Σ − 1 ( x − µ 0 ) = exp π 0 )]} − 1 { [ log π 1 + ( µ 1 − µ 0 ) T Σ − 1 x + 1 ( µ T 1 Σ − 1 µ 1 − µ T 0 Σ − 1 µ 0 = 1 + exp π 0 2 1 = 1 + exp( − w T x − b ) Naive Bayes and Gaussian Bayes Classifier February 22, 2016 19 / 21
Gaussian Bayes Binary Classifier Decision Boundary If the covariance is not shared between classes, p ( x , t = 1) = p ( x , t = 0) log π 1 − 1 1 ( x − µ 1 ) = log π 0 − 1 2( x − µ 1 ) T Σ − 1 2( x − µ 0 ) T Σ − 1 0 ( x − µ 0 ) ( ) ( ) x T ( Σ − 1 − Σ − 1 µ T 1 Σ − 1 − µ T 0 Σ − 1 µ T 0 Σ 0 µ 0 − µ T ) x − 2 x + 1 Σ 1 µ 1 = C 1 0 1 0 ⇒ x T Qx − 2 b T x + c = 0 The decision boundary is a quadratic function. In 2-d case, it looks like an ellipse, or a parabola, or a hyperbola. Naive Bayes and Gaussian Bayes Classifier February 22, 2016 20 / 21
Thanks! Naive Bayes and Gaussian Bayes Classifier February 22, 2016 21 / 21
Recommend
More recommend