Logis&c ¡Regression ¡ Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014
Recall: ¡Linear ¡Regression ¡ 200 ¡ 180 ¡ 160 ¡ 140 ¡ Power ¡(bhp) ¡ 120 ¡ 100 ¡ 80 ¡ 60 ¡ 40 ¡ 20 ¡ 0 ¡ 0 ¡ 500 ¡ 1000 ¡ 1500 ¡ 2000 ¡ 2500 ¡ Engine ¡displacement ¡(cc) ¡ § Assume: the relation is linear § Then for a given x (=1800), predict the value of y § Both the dependent and the independent variables are continuous 2 ¡
Scenario: ¡Heart ¡disease ¡– ¡vs ¡– ¡Age ¡ Training set Age (numarical): Yes ¡ independent variable Heart disease ( Y ) Heart disease (Yes/No): dependent variable with two classes Task: Given a new No ¡ person’s age, predict if 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ (s)he has heart disease Age ( X ) The task: calculate P ( Y = Yes | X ) 3 ¡
Scenario: ¡Heart ¡disease ¡– ¡vs ¡– ¡Age ¡ Training set Age (numarical): Yes ¡ independent variable Heart disease ( Y ) Heart disease (Yes/No): dependent variable with two classes Task: Given a new No ¡ person’s age, predict if 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ (s)he has heart disease Age ( X ) § Calculate P ( Y = Yes | X ) for different ranges of X § A curve that estimates the probability P ( Y = Yes | X ) 4 ¡
The ¡Logis&c ¡func&on ¡ Logistic function on t : takes values between 0 and 1 e t 1 Logistic ( t ) = 1 + e t = 1 + e − t If t is a linear function of x L ( t ) t = β 0 + β 1 x Logistic function becomes: 1 F ( x ) = t 1 + e − ( β 0 + β 1 x ) Probability of the dependent variable The logistic curve Y taking one value against another 5 ¡
The ¡Likelihood ¡func&on ¡ § Let, a discrete random variable X has a probability distribution p ( x ; θ ), that depends on a parameter θ § In case of Bernoulli’s distribution p ( x ; θ ) = θ x (1 − θ ) 1 − x § Intuitively, likelihood is “how likely” is an outcome being estimated correctly by the parameter θ – For x = 1, p ( x ; θ ) = θ – For x = 0, p ( x ; θ ) = 1 − θ § Given a set of data points x 1 , x 2 ,…, x n , the likelihood function is defined as: n ∏ l ( θ ) = p ( x i ; θ ) i = 1 6 ¡
About ¡the ¡Likelihood ¡func&on ¡ n ∏ l ( θ ) = p ( x i ; θ ) i = 1 § The actual value does not have any meaning, only the relative likelihood matters, as we want to estimate the parameter θ § Constant factors do not matter § Likelihood is not a probability density function § The sum (or integral) does not add up to 1 § In practice it is often easier to work with the log-likelihood § Provides same relative comparison § The expression becomes a sum " % n n ∏ ∑ ( ) = ln ( ) L ( θ ) = ln l ( θ ) p ( x i ; θ ) ln p ( x i ; θ ) ' = $ # & i = 1 i = 1 7 ¡
Example ¡ § Experiment: a coin toss, not known to be unbiased § Random variable X takes values 1 if head and 0 if tail § Data: 100 outcomes, 75 heads, 25 tails L ( θ ) = 75 × ln( θ ) + 25 × ln(1 − θ ) § Relative likelihood: if θ 1 > θ 2 , L ( θ 1 ) > L ( θ 2 ) 8 ¡
Maximum ¡likelihood ¡es&mate ¡ § Maximum likelihood estimation: Estimating the set of values for the parameters (for example, θ ) which maximizes the likelihood function § Estimate: " % n ∑ [ ] = argmax θ ( ) argmax θ L ( θ ) ln p ( x i ; θ ) $ ' # & i = 1 § One method: Newton’s method – Start with some value of θ and iteratively improve – Converge when improvement is negligible § May not always converge 9 ¡
Taylor’s ¡theorem ¡ § If f is a – Real-valued function – k times differentiable at a point a, for an integer k > 0 Then f has a polynomial approximation at a § In other words, there exists a function h k , such that ( x − a ) + ... + f ( k − 1) ( a ) f ( x ) = f ( a ) + f '( a ) ( x − a ) k + h k ( x )( x − a ) k 1! k ! ! ####### " ####### $ P ( x ) and ( ) = 0 lim x → a h k ( x ) Polynomial approximation ( k- th order Taylor’s polynomial) 10 ¡
Newton’s ¡method ¡ § Finding the global maximum w * of a function f of one variable Assumptions: 1. The function f is smooth 2. The derivative of f at w * is 0, second derivative is negative § Start with a value w = w 0 § Near the maximum, approximate the function using a second order Taylor polynomial 2 ( w − w 0 ) d 2 f f ( w ) ≈ f ( w 0 ) + ( w − w 0 ) df + 1 dw 2 dw w = w 0 w = w 0 ≈ f ( w 0 ) + ( w − w 0 ) f '( w 0 ) + 1 2 ( w − w 0 ) f ''( w 0 ) § Using the gradient descent approach iteratively estimate the maximum of f 11 ¡
Newton’s ¡method ¡ f ( w ) ≈ f ( w 0 ) + ( w − w 0 ) f '( w 0 ) + 1 2 ( w − w 0 ) f ''( w 0 ) § Take derivative w.r.t. w, and set it to zero at a point w 1 f '( w 1 ) ≈ 0 = f '( w 0 ) + 1 2 f ''( w 0 ) × 2( w 1 − w 0 ) ⇒ w 1 = w 0 − f '( w 0 ) f ''( w 0 ) Iteratively: w n + 1 = w n − f '( w n ) f ''( w n ) § Converges very fast, if at all § Use the optim function in R 12 ¡
Logis&c ¡Regression: ¡Es&ma&ng ¡ β 0 ¡and ¡ β 1 § Logistic function e β 0 + β 1 x 1 F ( x ) = 1 + e β 0 + β 1 x = 1 + e − ( β 0 + β 1 x ) § Log-likelihood function – Say we have n data points x 1 , x 2 ,…, x n – Outcomes y 1 , y 2 ,…, y n , each either 0 or 1 – Each y i = 1 with probabilities p and 0 with probability 1 − p n ∑ ( ) = L ( β ) = ln l ( β ) y i ln p ( x i ) + (1 − y i )ln(1 − p ( x i )) i = 1 n ∑ ) − ln(1 + e β 0 + β 1 x ) ( y i β 0 + β 1 x = i = 1 13 ¡
Visualiza&on ¡ § Fit some plot with Yes ¡ parameters β 0 and β 1 Heart disease ( Y ) 0.25 ¡ 0.75 ¡ 0.5 ¡ No ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ Age ( X ) 14 ¡
Visualiza&on ¡ § Fit some plot with Yes ¡ parameters β 0 and β 1 § Iteratively adjust Heart disease ( Y ) curve and the 0.25 ¡ probabilities of some 0.75 ¡ point being classified 0.5 ¡ as one class vs another No ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ Age ( X ) For a single independent variable x the separation is a point x = a 15 ¡
Two ¡independent ¡variables ¡ Separation is a line where the probability 200 becomes 0.5 Income (thousand rupees) 150 100 50 0.75 ¡ 0.5 ¡ 0.25 ¡ 0 30 40 50 60 70 80 Age (Years) 16 ¡
Wrapping up classification CLASSIFICATION ¡ 17 ¡
Binary ¡and ¡Mul&-‑class ¡classifica&on ¡ § Binary classification: – Target class has two values – Example: Heart disease Yes / No § Multi-class classification – Target class can take more than two values – Example: text classification into several labels (topics) § Many classifiers are simple to use for binary classification tasks § How to apply them for multi-class problems? 18 ¡
Compound ¡and ¡Monolithic ¡classifiers ¡ § Compound models – By combining binary submodels – 1-vs-all: for each class c , determine if an observation belongs to c or some other class – 1-vs-last § Monolithic models (a single classifier) – Examples: decision trees, k-NN 19 ¡
Recommend
More recommend