CSC 311: Introduction to Machine Learning Lecture 3 - Linear - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression, Multiclass Classification Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec3 1 / 43

Overview Classification: predicting a discrete-valued target ◮ Binary classification: predicting a binary-valued target ◮ Multiclass classification: predicting a discrete( > 2)-valued target Examples of binary classification ◮ predict whether a patient has a disease, given the presence or absence of various symptoms ◮ classify e-mails as spam or non-spam ◮ predict whether a financial transaction is fraudulent Intro ML (UofT) CSC311-Lec3 2 / 43

Overview Binary linear classification classification: given a D -dimensional input x ∈ R D predict a discrete-valued target binary: predict a binary target t ∈ { 0 , 1 } ◮ Training examples with t = 1 are called positive examples, and training examples with t = 0 are called negative examples. Sorry. ◮ t ∈ { 0 , 1 } or t ∈ {− 1 , +1 } is for computational convenience. linear: model prediction y is a linear function of x , followed by a threshold r : z = w ⊤ x + b � 1 if z ≥ r y = 0 if z < r Intro ML (UofT) CSC311-Lec3 3 / 43

Some Simplifications Eliminating the threshold We can assume without loss of generality (WLOG) that the threshold r = 0: w ⊤ x + b ≥ r w ⊤ x + b − r ⇐ ⇒ ≥ 0 . � �� w 0 Eliminating the bias Add a dummy feature x 0 which always takes the value 1. The weight w 0 = b is equivalent to a bias (same as linear regression) Simplified model Receive input x ∈ R D +1 with x 0 = 1: z = w ⊤ x � 1 if z ≥ 0 y = 0 if z < 0 Intro ML (UofT) CSC311-Lec3 4 / 43

Examples Let’s consider some simple examples to examine the properties of our model Let’s focus on minimizing the training set error, and forget about whether our model will generalize to a test set. Intro ML (UofT) CSC311-Lec3 5 / 43

Examples NOT x 0 x 1 t 1 0 1 1 1 0 Suppose this is our training set, with the dummy feature x 0 included. Which conditions on w 0 , w 1 guarantee perfect classification? ◮ When x 1 = 0, need: z = w 0 x 0 + w 1 x 1 ≥ 0 ⇐ ⇒ w 0 ≥ 0 ◮ When x 1 = 1, need: z = w 0 x 0 + w 1 x 1 < 0 ⇐ ⇒ w 0 + w 1 < 0 Example solution: w 0 = 1 , w 1 = − 2 Is this the only solution? Intro ML (UofT) CSC311-Lec3 6 / 43

Examples AND x 0 x 1 x 2 t z = w 0 x 0 + w 1 x 1 + w 2 x 2 1 0 0 0 need: w 0 < 0 1 0 1 0 need: w 0 + w 2 < 0 1 1 0 0 need: w 0 + w 1 < 0 1 1 1 1 need: w 0 + w 1 + w 2 ≥ 0 Example solution: w 0 = − 1 . 5, w 1 = 1, w 2 = 1 Intro ML (UofT) CSC311-Lec3 7 / 43

The Geometric Picture Input Space, or Data Space for NOT example x 0 x 1 t 1 0 1 1 1 0 Training examples are points Weights (hypotheses) w can be represented by half-spaces H + = { x : w ⊤ x ≥ 0 } , H − = { x : w ⊤ x < 0 } ◮ The boundaries of these half-spaces pass through the origin (why?) The boundary is the decision boundary: { x : w ⊤ x = 0 } ◮ In 2-D, it’s a line, but in high dimensions it is a hyperplane If the training examples can be perfectly separated by a linear decision rule, we say data is linearly separable. Intro ML (UofT) CSC311-Lec3 8 / 43

The Geometric Picture Weight Space w 0 ≥ 0 w 0 + w 1 < 0 Weights (hypotheses) w are points Each training example x specifies a half-space w must lie in to be correctly classified: w ⊤ x ≥ 0 if t = 1. For NOT example: ◮ x 0 = 1 , x 1 = 0 , t = 1 = ⇒ ( w 0 , w 1 ) ∈ { w : w 0 ≥ 0 } ◮ x 0 = 1 , x 1 = 1 , t = 0 = ⇒ ( w 0 , w 1 ) ∈ { w : w 0 + w 1 < 0 } The region satisfying all the constraints is the feasible region; if this region is nonempty, the problem is feasible, otw it is infeasible. Intro ML (UofT) CSC311-Lec3 9 / 43

The Geometric Picture The AND example requires three dimensions, including the dummy one. To visualize data space and weight space for a 3-D example, we can look at a 2-D slice. The visualizations are similar. ◮ Feasible set will always have a corner at the origin. Intro ML (UofT) CSC311-Lec3 10 / 43

The Geometric Picture Visualizations of the AND example Weight Space Data Space - Slice for w 0 = − 1 . 5 for the - Slice for x 0 = 1 and constraints - example sol: w 0 = − 1 . 5, w 1 =1, w 2 =1 - w 0 < 0 - decision boundary: - w 0 + w 2 < 0 w 0 x 0 + w 1 x 1 + w 2 x 2 =0 - w 0 + w 1 < 0 = ⇒ − 1 . 5+ x 1 + x 2 =0 - w 0 + w 1 + w 2 ≥ 0 Intro ML (UofT) CSC311-Lec3 11 / 43

Summary — Binary Linear Classifiers Summary: Targets t ∈ { 0 , 1 } , inputs x ∈ R D +1 with x 0 = 1, and model is defined by weights w and z = w ⊤ x � 1 if z ≥ 0 y = 0 if z < 0 How can we find good values for w ? If training set is linearly separable, we could solve for w using linear programming ◮ We could also apply an iterative procedure known as the perceptron algorithm (but this is primarily of historical interest). If it’s not linearly separable, the problem is harder ◮ Data is almost never linearly separable in real life. Intro ML (UofT) CSC311-Lec3 12 / 43

Towards Logistic Regression Intro ML (UofT) CSC311-Lec3 13 / 43

Loss Functions Instead: define loss function then try to minimize the resulting cost function ◮ Recall: cost is loss averaged (or summed) over the training set Seemingly obvious loss function: 0-1 loss � 0 if y = t L 0 − 1 ( y, t ) = 1 if y � = t = I [ y � = t ] Intro ML (UofT) CSC311-Lec3 14 / 43

Attempt 1: 0-1 loss Usually, the cost J is the averaged loss over training examples; for 0-1 loss, this is the misclassification rate: N J = 1 � I [ y ( i ) � = t ( i ) ] N i =1 Intro ML (UofT) CSC311-Lec3 15 / 43

Attempt 1: 0-1 loss Problem: how to optimize? In general, a hard problem (can be NP-hard) This is due to the step function (0-1 loss) not being nice (continuous/smooth/convex etc) Intro ML (UofT) CSC311-Lec3 16 / 43

Attempt 1: 0-1 loss Minimum of a function will be at its critical points. Let’s try to find the critical point of 0-1 loss Chain rule: ∂ L 0 − 1 = ∂ L 0 − 1 ∂z ∂w j ∂z ∂w j But ∂ L 0 − 1 /∂z is zero everywhere it’s defined! ◮ ∂ L 0 − 1 /∂w j = 0 means that changing the weights by a very small amount probably has no effect on the loss. ◮ Almost any point has 0 gradient! Intro ML (UofT) CSC311-Lec3 17 / 43

Attempt 2: Linear Regression Sometimes we can replace the loss function we care about with one which is easier to optimize. This is known as relaxation with a smooth surrogate loss function. One problem with L 0 − 1 : defined in terms of final prediction, which inherently involves a discontinuity Instead, define loss in terms of w ⊤ x directly ◮ Redo notation for convenience: z = w ⊤ x Intro ML (UofT) CSC311-Lec3 18 / 43

Attempt 2: Linear Regression We already know how to fit a linear regression model. Can we use this instead? z = w ⊤ x L SE ( z, t ) = 1 2( z − t ) 2 Doesn’t matter that the targets are actually binary. Treat them as continuous values. For this loss function, it makes sense to make final predictions by thresholding z at 1 2 (why?) Intro ML (UofT) CSC311-Lec3 19 / 43

Attempt 2: Linear Regression The problem: The loss function hates when you make correct predictions with high confidence! If t = 1, it’s more unhappy about z = 10 than z = 0. Intro ML (UofT) CSC311-Lec3 20 / 43

Attempt 3: Logistic Activation Function There’s obviously no reason to predict values outside [0, 1]. Let’s squash y into this interval. The logistic function is a kind of sigmoid, or S-shaped function: 1 σ ( z ) = 1 + e − z σ − 1 ( y ) = log( y/ (1 − y )) is called the logit. A linear model with a logistic nonlinearity is known as log-linear: z = w ⊤ x y = σ ( z ) L SE ( y, t ) = 1 2( y − t ) 2 . Used in this way, σ is called an activation function. Intro ML (UofT) CSC311-Lec3 21 / 43

Attempt 3: Logistic Activation Function The problem: (plot of L SE as a function of z , assuming t = 1) ∂ L = ∂ L ∂z ∂w j ∂z ∂w j For z ≪ 0, we have σ ( z ) ≈ 0. ∂ L ∂ L ∂z ≈ 0 (check!) = ⇒ ∂w j ≈ 0 = ⇒ derivative w.r.t. w j is small = ⇒ w j is like a critical point If the prediction is really wrong, you should be far from a critical point (which is your candidate solution). Intro ML (UofT) CSC311-Lec3 22 / 43

Logistic Regression Because y ∈ [0 , 1], we can interpret it as the estimated probability that t = 1. If t = 0, then we want to heavily penalize y ≈ 1. The pundits who were 99% confident Clinton would win were much more wrong than the ones who were only 90% confident. Cross-entropy loss (aka log loss) captures this intuition: � − log y if t = 1 L CE ( y, t ) = − log(1 − y ) if t = 0 = − t log y − (1 − t ) log(1 − y ) Intro ML (UofT) CSC311-Lec3 23 / 43

Logistic Regression Logistic Regression: z = w ⊤ x y = σ ( z ) 1 = 1 + e − z L CE = − t log y − (1 − t ) log(1 − y ) Plot is for target t = 1. Intro ML (UofT) CSC311-Lec3 24 / 43

CSC 311: Introduction to Machine Learning Lecture 3 - Linear - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression, Multiclass Classification Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec3 1 /

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Fraud Detection, Quantum Mechanics, and Complex Systems Lecture 3, CSSS10 Greg Leibon Memento,

ASX Investor Hour DISCLAIMER: The following material was presented at ASX Investor Hour. The

An Informatics View of the World An HSE Moscow Seminar, Wednesday April 22, 2015 Dines Bjrner

Making It Rain with Machine Learning James Kazmierczak and Neil Shah Mentors: Chae Clark and

Incidents of the Week Wikileaks disclosure of unedited cables DigiNotar fake certificates

What is a tax crime? England and Wales Fraud (s1 Fraud Act 2006) False Accounting

KiwiSaver & Student loan changes Webinar for Employers & Not-for-profit organisations 19

Emerging countries: winners and losers in the trade war. A brief note Giovanni Graziani

CSC 311: Introduction to Machine Learning Lecture 3 - Linear - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression, Multiclass Classification Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec3 1 /

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees &amp; Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Fraud Detection, Quantum Mechanics, and Complex Systems Lecture 3, CSSS10 Greg Leibon Memento,

ASX Investor Hour DISCLAIMER: The following material was presented at ASX Investor Hour. The

An Informatics View of the World An HSE Moscow Seminar, Wednesday April 22, 2015 Dines Bjrner

Making It Rain with Machine Learning James Kazmierczak and Neil Shah Mentors: Chae Clark and

Incidents of the Week Wikileaks disclosure of unedited cables DigiNotar fake certificates

What is a tax crime? England and Wales Fraud (s1 Fraud Act 2006) False Accounting

KiwiSaver &amp; Student loan changes Webinar for Employers &amp; Not-for-profit organisations 19

Emerging countries: winners and losers in the trade war. A brief note Giovanni Graziani

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

KiwiSaver & Student loan changes Webinar for Employers & Not-for-profit organisations 19