Summary ◮ Linearly separable classification problems. ◮ Logistic loss ℓ log and (empirical) risk � R log . ◮ Gradient descent. 20 / 68
(Slide from last time) Classification For now, let’s consider binary classification: Y = {− 1 , +1 } . A linear predictor w ∈ R d classifies according to sign( w T x ) ∈ {− 1 , +1 } . Given (( x i , y i )) n i =1 , a predictor w ∈ R d , � � w T x i we want sign and y i to agree. 21 / 68
(Slide from last time) Logistic loss 1 Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 22 / 68
(Slide from last time) Logistic loss 1 Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 22 / 68
(Slide from last time) Logistic loss 2 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 8 4 0 4 8 0.400 0.800 1.200 2 . . . 2 . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 23 / 68
(Slide from last time) Logistic loss 3 1.0 1.0 0.8 0.8 0.6 0.6 - - 8 4 1 0 1 - - . . 1 . - . 0 . 0 - 0 2 0 4 0 8 2 0 0 0 2 0 . 0 . 0 . 0 . . 0 . 0 0 8 . 4 8 . 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 24 / 68
(Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 25 / 68
(Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? 25 / 68
(Slide from last time) Gradient descent 1 Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity. 25 / 68
(Slide from last time) Gradient descent 2 Gradient descent is the iteration: w i +1 := w i − η i ∇ w � R log ( w i ) . ◮ Note ℓ ′ − 1 log ( z ) = 1+exp( z ) , and use the chain rule ( hw1 !). ◮ Or use pytorch: def GD(X, y, loss, step = 0.1, n iters = 10000): w = torch.zeros(X.shape[1], requires grad = True) for i in range (n iters): l = loss(X, y, w).mean() l.backward() with torch.no grad(): w − = step ∗ w.grad w.grad.zero () return w 26 / 68
Part 2 of logistic regression. . .
5. A maximum likelihood derivation
MLE and ERM We’ve studied an ERM perspective on logistic regression: � n ◮ Form empirical logistic risk � R log ( w ) = 1 i =1 ln(1 + exp( − y i w T x i )) . n ◮ Approximately solve arg min w ∈ R d � R log ( w ) via gradient descent (or other convex optimization technique). We only justified it with “popularity”! Today we’ll derive � R log via Maximum Likelihood Estimation (MLE). 1. We form a model for Pr[ Y = 1 | X = x ] , parameterized by w . 2. We form a full data log-likelihood (equivalent to � R log ). Let’s first describe the distributions underlying the data. 27 / 68
Learning prediction functions IID model for supervised learning : ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) are iid random pairs (i.e., labeled examples ). ◮ X takes values in X . E.g., X = R d . ◮ Y takes values in Y . E.g., ( regression problems ) Y = R ; ( classification problems ) Y = { 1 , . . . , K } or Y = { 0 , 1 } or Y = {− 1 , +1 } . 1. We observe ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , and the choose a prediction function (i.e., predictor ) ˆ f : X → Y , This is called “ learning ” or “ training ”. 2. At prediction time, observe X , and form prediction ˆ f ( X ) . 3. Outcome is Y , and f ( X ) − Y ) 2 (regression problems). ◮ squared loss is ( ˆ ◮ zero-one loss is 1 { ˆ f ( X ) � = Y } (classification problems). Note : expected zero-one loss is E [ 1 { ˆ f ( X ) � = Y } ] = P ( ˆ f ( X ) � = Y ) , which we also call error rate . 28 / 68
Distributions over labeled examples X : space of possible side-information ( feature space ). Y : space of possible outcomes ( label space or output space ). Distribution P of random pair ( X, Y ) taking values in X × Y can be thought of in two parts: 1. Marginal distribution P X of X : P X is a probability distribution on X . 2. Conditional distribution P Y | X = x of Y given X = x , for each x ∈ X : P Y | X = x is a probability distribution on Y . 29 / 68
Optimal classifier For binary classification, what function f : X → { 0 , 1 } has smallest risk (i.e., error rate ) R ( f ) := P ( f ( X ) � = Y ) ? ◮ Conditional on X = x , the minimizer of conditional risk y �→ P (ˆ ˆ y � = Y | X = x ) is � 1 if P ( Y = 1 | X = x ) > 1 / 2 , y := ˆ 0 if P ( Y = 1 | X = x ) ≤ 1 / 2 . ◮ Therefore, the function f ⋆ : X → { 0 , 1 } where � 1 if P ( Y = 1 | X = x ) > 1 / 2 , f ⋆ ( x ) = x ∈ X , 0 if P ( Y = 1 | X = x ) ≤ 1 / 2 , has the smallest risk. ◮ f ⋆ is called the Bayes (optimal) classifier . For Y = { 1 , . . . , K } , f ⋆ ( x ) = arg max P ( Y = y | X = x ) , x ∈ X . y ∈Y 30 / 68
Logistic regression Suppose X = R d and Y = { 0 , 1 } . A logistic regression model is a statistical model where the conditional probability function has a particular form: x ∈ R d , Y | X = x ∼ Bern( η w ( x )) , with T w ) , x ∈ R d η w ( x ) := logistic( x (with parameters w ∈ R d ), and e z 1 logistic( z ) := 1 + e − z = 1 + e z , z ∈ R . 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 ◮ Conditional distribution of Y given X is Bernoulli; marginal distribution of X not specified. ◮ With least squares, Y | X = x was N( w T x, σ 2 ) . 31 / 68
MLE for logistic regression Log-likelihood of w in iid logistic regression model, given data ( X i , Y i ) = ( x i , y i ) for i = 1 , . . . , n : n � η w ( x i ) y i � � 1 − y i ln 1 − η w ( x i ) i =1 � n � � = y i ln η w ( x i ) + (1 − y i ) ln(1 − η w ( x i )) i =1 � � � n T x i )) + (1 − y i ) ln(1 + exp( w T x i )) = − y i ln(1 + exp( − w i =1 n � T x i )) , = − ln(1 + exp( − (2 y i − 1) w i =1 and old form is recovered with labels ˜ y i := 2 y i − 1 ∈ {− 1 , +1 } . 32 / 68
Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by e x T β η β ( x ) 1 + e x T β T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x 1 1 + e x T β is a linear function 1 , parameterized by β ∈ R d . 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68
Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by e x T β η β ( x ) 1 + e x T β T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x 1 1 + e x T β is a linear function 1 , parameterized by β ∈ R d . Bayes optimal classifier f β : R d → { 0 , 1 } in logistic regression model: � if x T β ≤ 0 , 0 f β ( x ) = if x T β > 0 . 1 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68
Log-odds function and classifier Equivalent way to characterize logistic regression model : The log-odds function , given by e x T β η β ( x ) 1 + e x T β T β , log-odds β ( x ) = ln 1 − η β ( x ) = ln = x 1 1 + e x T β is a linear function 1 , parameterized by β ∈ R d . Bayes optimal classifier f β : R d → { 0 , 1 } in logistic regression model: � if x T β ≤ 0 , 0 f β ( x ) = if x T β > 0 . 1 Such classifiers are called linear classifiers . 1 Some authors allow affine function; we can get this using affine expansion. 33 / 68
Recommend
More recommend