Logistic regression CS 446 1. Linear classifiers Linear regression - PowerPoint PPT Presentation

Logistic regression CS 446

1. Linear classifiers

Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68

Linear classification Today, the goal is a linear classifier ; the output/label space Y is discrete. 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2 / 68

Notation For now, let’s consider binary classification: Y = {− 1 , +1 } . A linear predictor w ∈ R d classifies according to sign( w T x ) ∈ {− 1 , +1 } . Given (( x i , y i )) n i =1 , a predictor w ∈ R d , � � w T x i we want sign and y i to agree. 3 / 68

Geometry of linear classifiers x 2 A hyperplane in R d is a linear subspace of dimension d − 1 . ◮ A hyperplane in R 2 is a line. H ◮ A hyperplane in R 3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. w x 1 A hyperplane H can be specified by a (non-zero) normal vector w ∈ R d . The hyperplane with normal vector w is the set of points orthogonal to w : � � x ∈ R d : x T w = 0 H = . Given w and its corresponding H : H splits the sets labeled positive { x : w T x > 0 } and those labeled negative { x : w T w < 0 } . 4 / 68

Classification with a hyperplane H w span { w } 5 / 68

Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) 5 / 68

Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w 5 / 68

Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin? 5 / 68

Linear separability Is it always possible to find w with sign( w T x i ) = y i ? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.) 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Linearly separable. Not linearly separable. 6 / 68

Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary 7 / 68

Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary Same feature expansions we saw for linear regression models can also be used here to “upgrade” linear classifiers. 7 / 68

Finding linear classifiers with ERM Why not feed our goal into an optimization package, in the form n � 1 T x i ) � = y i ] ? arg min 1 [sign( w n w ∈ R d i =1 8 / 68

Finding linear classifiers with ERM Why not feed our goal into an optimization package, in the form n � 1 T x i ) � = y i ] ? arg min 1 [sign( w n w ∈ R d i =1 ◮ Discrete/combinatorial search; often NP-hard. 8 / 68

Relaxing the ERM problem Let’s remove one source of discreteness: n n � � � � 1 1 T x i ) � = y i ] T x i ) ≤ 0 1 [sign( w − → y i ( w . 1 n n i =1 i =1 Did we lose something in this process? Should it be “ > ” or “ ≥ ”? 9 / 68

Relaxing the ERM problem Let’s remove one source of discreteness: n n � � � � 1 1 T x i ) � = y i ] T x i ) ≤ 0 1 [sign( w − → y i ( w . 1 n n i =1 i =1 Did we lose something in this process? Should it be “ > ” or “ ≥ ”? y i ( w T x i ) is the (unnormalized) margin of w on example i ; we have written this problem with a margin loss: n � R zo ( w ) = 1 � T x i ) ℓ zo ( y i w where ℓ zo ( z ) = 1 [ z ≤ 0] . n i =1 (Remainder of lecture will use single-parameter margin losses.) 9 / 68

2. Logistic loss and risk

Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 10 / 68

Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 10 / 68

Squared and logistic losses on linearly separable data I 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 8 4 0 4 8 0.400 0.800 1.200 2 . . . 2 . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 11 / 68

Squared and logistic losses on linearly separable data II 1.0 1.0 0.8 0.8 0.6 0.6 - - 8 4 1 0 1 - - . . 1 . - . 0 . 0 - 0 2 0 4 0 8 2 0 0 0 2 0 . 0 . 0 . 0 . . 0 . 0 0 8 . 4 8 . 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 12 / 68

Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. 13 / 68

Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. w T x i > 0 for all i , Theorem. If there exists ¯ w with y i ¯ then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . 13 / 68

Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. w T x i > 0 for all i , Theorem. If there exists ¯ w with y i ¯ then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . Proof. Step 1: low risk implies few mistakes. For any w with y j w T x j ≤ 0 for some j , R log ( w ) ≥ 1 T x j )) ≥ ln(2) � n ln(1 + exp( − y j w . n By contrapositive, any w with � R log ( w ) < ln(2) / n makes no mistakes. Step 2: inf v � R log ( v ) = 0 . Note: n � 1 � T x i )) = 0 . 0 ≤ inf R log ( v ) ≤ inf ln(1 + exp( − ry i ¯ w n r> 0 v i =1 This completes the proof. 13 / 68

3. Minimizing the empirical logistic risk

Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . 14 / 68

Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n 14 / 68

Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n Remark . Is A + b a “closed form expression”? 14 / 68

Decreasing � R We need to move down the contours of � R log : 10.0 7.5 5.0 2.000 2.5 4.000 1 1 8 6.000 4 0 . 0.0 12.000 . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 15 / 68

Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 16 / 68

Logistic regression CS 446 1. Linear classifiers Linear regression - PowerPoint PPT Presentation

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68 Linear

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Probabilistic & Unsupervised Learning Introduction and Foundations Maneesh Sahani

Mutually Beneficial Industry Partnerships Casey K. Sacks, Ph.D. Deputy Assistant Secretary for

Better Arabic Parsing Baselines, Evaluations, and Analysis Spence Green and Christopher D.

Equivalence with context-free grammars I Language context free recognized by pushdown automata

A benchmark study for CFD solvers: simulation of air flow in livestock husbandry Alfonso Caiazzo

Col A Ax 0 1 Let W = Col A where A is m n and A = . a 1 a 2 a n Suppose b is in R m

Accelerated Learning Opportunities for Adult Students/Learners S arita A. Rhonemus, Ph.D.,

Correction of Treebank Annotation: The Case of the Arabic Treebank Mohamed Maamouri, Ann Bies,

Logistic regression CS 446 1. Linear classifiers Linear regression - PowerPoint PPT Presentation

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68 Linear

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Probabilistic &amp; Unsupervised Learning Introduction and Foundations Maneesh Sahani

Mutually Beneficial Industry Partnerships Casey K. Sacks, Ph.D. Deputy Assistant Secretary for

Better Arabic Parsing Baselines, Evaluations, and Analysis Spence Green and Christopher D.

Equivalence with context-free grammars I Language context free recognized by pushdown automata

A benchmark study for CFD solvers: simulation of air flow in livestock husbandry Alfonso Caiazzo

Col A Ax 0 1 Let W = Col A where A is m n and A = . a 1 a 2 a n Suppose b is in R m

Accelerated Learning Opportunities for Adult Students/Learners S arita A. Rhonemus, Ph.D.,

Correction of Treebank Annotation: The Case of the Arabic Treebank Mohamed Maamouri, Ann Bies,

Probabilistic & Unsupervised Learning Introduction and Foundations Maneesh Sahani