Probabilistic modeling Subhransu Maji CMPSCI 689: Machine Learning 3 March 2015 5 March 2015
Administrivia Mini-project 1 due Thursday, March 05 � Turn in a hard copy � ‣ In the next class ‣ Or in CS main office reception area by 4:00pm (mention 689 hw) Clearly write your name and student id in the front page � Late submissions: � ‣ At most 48 hours at 50% deduction (by 4:00pm March 07) ‣ More than 48 hours get zero ‣ Submit a pdf via email to the TA: xiaojian@cs.umass.edu CMPSCI 689 Subhransu Maji (UMASS) 2 /32
Overview So far the models and algorithms you have learned about are relatively disconnected � Probabilistic modeling framework unites the two � Learning can be viewed as statistical inference � Two kinds of data models � ‣ Generative ‣ Conditional Two kinds of probability models � ‣ Parametric ‣ Non-parametric CMPSCI 689 Subhransu Maji (UMASS) 3 /32
Classification by density estimation The data is generated according to a distribution D � � ( x , y ) ∼ D ( x , y ) � Suppose you had access to D , then classification becomes simple: � � D (ˆ y = arg max ˆ x , y ) y � This is the Bayes optimal classifier which achieves the smallest expected loss among all classifiers � � ✏ (ˆ y ) = E ( x ,y ) ∼ D [ ` ( y, ˆ y )] : expected loss of a predictor � ⇢ 1 if y 6 = ˆ y y ∈ { 0 , 1 } � ` ( y, ˆ y ) = 0 otherwise � Unfortunately, we don’t have access to the distribution CMPSCI 689 Subhransu Maji (UMASS) 4 /32
Classification by density estimation This suggests that one way to learn a classifier is to estimate D � � Training data parametric distribution � ( x 1 , y 1 ) ∼ D � ( x 2 , y 2 ) ∼ D � Estimation ˆ D � … � Gaussian: N ( µ, σ 2 ) ( x n , y n ) ∼ D � Estimate the parameters of the distribution � We will assume that each point is independently generated from D � ‣ A new point doesn’t depend on previous points ‣ Commonly referred to as the i.i.d assumption or independently and identically distributed assumption CMPSCI 689 Subhransu Maji (UMASS) 5 /32
Statistical estimation Coin toss: observed sequence {H, T, H, H} � β Probability of H: � β What is the value of that best explains the observed data? � Maximum likelihood principle (MLE): pick parameters of the distribution that maximize the likelihood of the observed data � Likelihood of data: � p β (data) = p β (H,T,H,H) � = p β (H) p β (T) p β (H) p β (H) � i.i.d data � = β × (1 − β ) × β × β � = β 3 (1 − β ) � Maximize likelihood: = d β 3 (1 − β ) dp β (data) ⇒ β = 3 = 3 β 2 (1 − β ) + β 3 ( − 1) = 0 = d β d β 4 CMPSCI 689 Subhransu Maji (UMASS) 6 /32
Log-likelihood It is convenient to maximize the logarithm of the likelihood instead � Log-likelihood of the observed data: � log p β (data) = log p β (H,T,H,H) � � = log p β (H) + log p β (T) + log p β (H) + log p β (H) � = log β + log(1 − β ) + log β + log β � = 3 log β + log(1 − β ) � Maximizing the log-likelihood is equivalent to maximizing likelihood � ‣ Log is a concave monotonic function ‣ Products become sums ‣ Numerically stable CMPSCI 689 Subhransu Maji (UMASS) 7 /32
Log-likelihood Log-likelihood of observing H -many heads and T -many tails: � � log p β (data) = H log β + T log(1 − β ) � Maximizing the log-likelihood: d [H log β + T log(1 − β )] = H T 1 − β = 0 β − d β H ⇒ β = = H + T CMPSCI 689 Subhransu Maji (UMASS) 8 /32
Rolling a die θ 1 , θ 2 , . . . , θ k Suppose you are rolling a k-sided die with parameters: � You observe: � x 1 , x 2 , . . . , x k Log-likelihood of the data: � X � log p (data) = x k log θ k � k Maximizing the log-likelihood by setting the derivative to zero: � � d log p (data) = x k = 0 = ⇒ θ k = ∞ � d θ k θ k � We need additional constraints: X θ k = 1 k CMPSCI 689 Subhransu Maji (UMASS) 9 /32
Lagrangian multipliers Constrained optimization: � � X x k log θ k max θ 1 , θ 2 ..., θ k � k � X θ k = 1 subject to: � k � Unconstrained optimization: � ! � X X x k log θ k + λ θ k min max 1 − � λ { θ 1 , θ 2 ..., θ k } k k � x k ⇒ θ k = x k ‣ At optimality: = λ = θ k λ x k X λ = θ k = x k P k x k k CMPSCI 689 Subhransu Maji (UMASS) 10 /32
Naive Bayes Consider the binary prediction problem � Let the data be distributed according to a probability distribution: � � p θ ( y, x ) = p θ ( y, x 1 , x 2 , . . . , x D ) � We can simplify this using the chain rule of probability: � p θ ( y, x ) = p θ ( y ) p θ ( x 1 | y ) p θ ( x 2 | x 1 , y ) . . . p θ ( x D | x 1 , x 2 , . . . , x D − 1 , y ) � � D Y = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � � d =1 Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � E.g., The words “free” and “money” are independent given spam CMPSCI 689 Subhransu Maji (UMASS) 11 /32
Naive Bayes Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � We can simplify the joint probability distribution as: � D � Y p θ ( y, x ) = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � d =1 � D � Y = p θ ( y ) p θ ( x d | y ) // simpler distribution � d =1 � At this point we can start parametrizing the distribution CMPSCI 689 Subhransu Maji (UMASS) 12 /32
Naive Bayes: a simple case Case: binary labels and binary features � } p θ ( y ) = Bernoulli ( θ 0 ) � � p θ ( x d | y = 1) = Bernoulli ( θ + d ) 1+2D parameters � p θ ( x d | y = − 1) = Bernoulli ( θ − d ) � Probability of the data: D Y p θ ( y, x ) = p θ ( y ) p θ ( x d | y ) d =1 = θ [ y =+1] (1 − θ 0 ) [ y = − 1] 0 D θ +[ x d =1 ,y =+1] Y (1 − θ + d ) [ x d =0 ,y =+1] ... × // label +1 d d =1 D θ − [ x d =1 ,y = − 1] Y d ) [ x d =0 ,y = − 1] ... × (1 − θ − // label -1 d d =1 CMPSCI 689 Subhransu Maji (UMASS) 13 /32
Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood � Similar to the coin toss example the maximum likelihood estimates are: � � P n [ y n = +1] // fraction of the data with label as +1 ˆ θ 0 = � N � � P n [ x d,n = 1 , y n = +1] ˆ // fraction of the instances with 1 among +1 θ + d = � P n [ y n = +1] � � P n [ x d,n = 1 , y n = − 1] ˆ // fraction of the instances with 1 among -1 d = θ − � P n [ y n = − 1] � Other cases: � inductive � ‣ Nominal features: Multinomial distribution (like rolling a die) bias ‣ Continuous features: Gaussian distribution CMPSCI 689 Subhransu Maji (UMASS) 14 /32
Naive Bayes: prediction To make predictions compute the posterior distribution: � � y = arg max ˆ p θ ( y | x ) // Bayes optimal prediction y � p θ ( y, x ) � // Bayes rule = arg max p θ ( x ) � y = arg max p θ ( y, x ) � y � For binary labels we can also compute the likelihood ratio: � � LR = p θ (+1 , x ) ⇢ +1 LR ≥ 1 y = ˆ � − 1 otherwise p θ ( − 1 , x ) � Or the log likelihood ratio: ⇢ +1 LLR ≥ 0 LLR = log ( p θ (+1 , x )) − log ( p θ ( − 1 , x )) ˆ y = − 1 otherwise CMPSCI 689 Subhransu Maji (UMASS) 15 /32
Naive Bayes: decision boundary LLR = log ( p θ (+1 , x )) − log ( p θ ( − 1 , x )) ! ! D D θ +[ x d =1] θ − [ x d =1] Y Y (1 − θ + d ) [ x d =0] d ) [ x d =0] = log − log (1 − θ 0 ) (1 − θ − θ 0 d d d =1 d =1 D X log θ + � � = log θ 0 − log(1 − θ 0 ) + [ x d = 1] d − log θ − d d =1 D X log(1 − θ + � � . . . + [ x d = 0] d ) − log(1 − θ − d ) d =1 D D ✓ θ + ✓ 1 − θ + ✓ ◆ ◆ ◆ θ 0 X X d d = log + [ x d = 1] log + [ x d = 0] log 1 − θ 0 1 − θ − θ − d d d =1 d =1 D D ✓ θ + ✓ 1 − θ + ✓ ◆ ◆ ◆ θ 0 X X d d = log + x d log + (1 − x d ) log 1 − θ 0 1 − θ − θ − d d d =1 d =1 D D ✓ θ + ✓ 1 − θ + ✓ 1 − θ + ✓ ◆ ✓ ◆ ◆◆ ◆ θ 0 X X d d d = log + log − log + log x d 1 − θ 0 1 − θ − 1 − θ − θ − d d d d =1 d =1 = w T x + b Naive bayes classifier has a linear decision boundary! CMPSCI 689 Subhransu Maji (UMASS) 16 /32
Generative and conditional models Generative models: � ‣ Model the joint distribution p( x ,y) ‣ Use Bayes rule to compute the label posterior ‣ Need to make simplifying assumptions (e.g. Naive bayes) In most cases we are given x and are only interested in the labels y � Conditional models: � ‣ Model the distribution p(y | x ) ‣ Saves some modeling effort ‣ Can assume a simpler parametrization of the distribution p(y | x ) � ‣ Most of ML we did so far directly aimed at predicting y from x CMPSCI 689 Subhransu Maji (UMASS) 17 /32
Recommend
More recommend