L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu

Announcements Homework 1 will be out after class. http://courses.engr.illinois.edu/cs446/Homework/HW1.pdf You have two weeks to complete the assignment. Late policy: – Up to two days late credit for the whole semester, but we don’t give any partial late credit. If you’re late for one assignment by up to 24 hours, that’s 1 of your two late credit days. – We don’t accept assignments that are more than 48 hours late. Is everybody on Compass??? https://compass2g.illinois.edu/ Let us know if you can’t see our class. CS446 Machine Learning 2

Last lecture’s key concepts Decision trees for (binary) classification Non-linear classifiers Learning decision trees (ID3 algorithm) Greedy heuristic (based on information gain) Originally developed for discrete features Overfitting What is it? How do we deal with it? CS446 Machine Learning 3

Today’s key concepts Learning linear classifiers Batch algorithms: – Gradient descent for Least-mean squares Online algorithms: – Stochastic gradient descent CS446 Machine Learning 4

Linear classifiers CS446 Machine Learning 5

Linear classifiers: f( x ) = w 0 + wx f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 Linear classifiers are defined over vector spaces Every hypothesis f( x ) is a hyperplane: f( x ) = w 0 + wx f( x ) is also called the decision boundary – Assign ŷ = 1 to all x where f( x ) > 0 – Assign ŷ = -1 to all x where f( x ) < 0 ŷ = sgn(f( x ))

Hypothesis space for linear classifiers H x 2 x 2 x 2 x 2 x 2 0 1 0 1 0 1 0 1 0 1 x 1 0 0 0 x 1 0 0 0 x 1 0 0 0 x 1 0 1 0 x 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 0 x 2 x 2 x 2 x 2 x 2 x 2 0 1 0 1 0 1 0 1 0 1 0 1 x 1 0 0 0 x 1 0 1 0 x 1 0 0 1 x 1 0 1 0 x 1 0 0 1 x 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 x 2 x 2 x 2 x 2 x 2 0 1 0 1 0 1 0 1 0 1 x 1 0 1 0 x 1 0 0 1 x 1 0 1 1 x 1 0 1 1 x 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 CS446 Machine Learning 7

Canonical representation With w = (w 1 , …, w N ) T and x = (x 1 , …, x N ) T : f(x) = w 0 + wx = w 0 + ∑ i=1…N w i x i w 0 is called the bias term. The canonical representation redefines w , x as w = (w 0 , w 1 , …, w N ) T and x = (1, x 1 , …, x N ) T => f( x ) = w·x CS446 Machine Learning 8

Learning a linear classifier f( x ) = 0 f( x ) > 0 x 2 x 2 f( x ) < 0 x 1 x 1 Input: Labeled training data Output: A decision boundary f( x ) = 0 D = {( x 1 , y 1 ),…,( x D , y D )} that separates the training data plotted in the sample space X = R 2 y i ·f( x i ) > 0 with : y i = +1, : y i = 1 CS446 Machine Learning 9

Which model should we pick? We need a metric (aka an objective function) We would like to minimize the probability of misclassifying unseen examples, but we can’t measure that probability. Instead: minimize the number of misclassified training examples CS446 Machine Learning 10

Which model should we pick? We need a more specific metric: There may be many models that are consistent with the training data. Loss functions provide such metrics. CS446 Machine Learning 11

Loss functions for classification CS446 Machine Learning 12

y·f( x ) > 0: Correct classification f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 An example ( x , y) is correctly classified by f( x ) if and only if y·f( x ) > 0: Case 1 (y = +1 = ŷ ): f( x ) > 0 ⇒ y·f( x ) > 0 Case 2 (y = -1 = ŷ ): f( x ) < 0 ⇒ y·f( x ) > 0 Case 3 (y = +1 ≠ ŷ = -1): f( x ) > 0 ⇒ y·f( x ) < 0 Case 4 (y = -1 ≠ ŷ = +1): f( x ) < 0 ⇒ y·f( x ) < 0

Loss functions for classification Loss = What penalty do we incur if we misclassify x ? L( y , f ( x )) is the loss (aka cost) of classifier f on example x when the true label of x is y . We assign label ŷ = sgn(f( x )) to x Plots of L( y , f ( x )): x-axis is typically y·f( x ) Today: 0-1 loss and square loss (more loss functions later) CS446 Machine Learning 14

0-1 Loss 0-1 Loss 4 3.5 3 2.5 2 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 15

0-1 Loss 0-1 Loss 4 3.5 3 2.5 2 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) L( y, f( x )) = 0 iff y = ŷ = 1 iff y ≠ ŷ L( y ·f( x ) ) = 0 iff y ·f( x ) > 0 (correctly classified) = 1 iff y ·f( x ) < 0 (misclassified) CS446 Machine Learning 16

Square Loss: ( y – f( x )) 2 Square loss as a function of f(x) Square loss as a function of y*f(x) 4 4 y = +1 3.5 3.5 y = -1 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 f(x) y*f(x) L( y, f( x )) = ( y – f( x )) 2 Note: L(-1, f( x )) = (-1 – f( x )) 2 = ( 1 + f( x )) 2 = L(1, -f( x )) (the loss when y=-1 [red] is the mirror of the loss when y=+1 [blue]) CS446 Machine Learning 17

The square loss is a convex upper bound on 0-1 Loss Loss as a function of y*f(x) 4 0-1 Loss 3.5 Square Loss 3 2.5 2 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 18

Loss surface Linear classification: Hypothesis space is parameterized by w Plain English: Each w yields a different classifier Error/Loss/Risk are all functions of w CS446 Machine Learning 19

The loss surface Learning = finding the (empirical) global minimum of the error � loss surface global   hypothesis minimum � space � CS440/ECE448: Intro AI 20

The loss surface Finding the global (empirical) minimum in general error � is hard plateau � local minimum � global   hypothesis minimum � space � CS440/ECE448: Intro AI 21

Convex loss surfaces Convex functions have (empirical)   no local minima error � global   minimum � hypothesis space � CS440/ECE448: Intro AI 22

The risk of a classifier R(f) The risk (aka generalization error) of a classifier f( x ) = w · x is its expected loss: (= loss, averaged over all possible data sets): R(f) = ∫ L(y, f( x )) P ( x , y) d x ,y Ideal learning objective: Find an f that minimizes risk CS446 Machine Learning 23

Aside: The i.i.d. assumption We always assume that training and test items are independently and identically distributed (i.i.d.): – There is a distribution P ( X, Y) from which the data D = {( x , y)} is generated. Sometimes it’s useful to rewrite P ( X , Y) as P ( X ) P (Y| X ) Usually P ( X , Y) is unknown to us (we just know it exists) – Training and test data are samples drawn from the same P ( X, Y): they are identically distributed – Each ( x , y) is drawn independently from P ( X, Y) CS446 Machine Learning 24

The empirical risk of f( x ) The empirical risk of a classifier f( x ) = w · x on data set D = {( x 1 , y 1 ),…,( x D , y D )} is its average loss on the items in D D R D (f) = 1 L ( y i ,f( x i ) ∑ ) D i = 1 Realistic learning objective: Find an f that minimizes empirical risk (Note that the learner can ignore the constant 1/D) CS446 Machine Learning 25

Empirical risk minimization Learning: Given training data D = {( x 1 , y 1 ),…,( x D , y D )}, return the classifier f( x ) that minimizes the empirical risk R D ( f ) CS446 Machine Learning 26

Batch learning: Gradient Descent for Least Mean Squares (LMS) CS446 Machine Learning 27

Gradient Descent Iterative batch learning algorithm: – Learner updates the hypothesis based on the entire training data – Learner has to go multiple times over the training data Goal: Minimize training error/loss – At each step: move w in the direction of steepest descent along the error/loss surface CS446 Machine Learning 28

Gradient Descent Error( w ): Error of w on training data w i : Weight at iteration i Error( w ) w w 4 w 3 w 2 w 1 CS446 Machine Learning 29

Least Mean Square Error Err( w ) = 1 y d ) 2 ∑ − ˆ ( y d 2 d ∈ D LMS Error: Sum of square loss over all training items (multiplied by 0.5 for convenience) D is fixed, so no need to divide by its size Goal of learning: Find w* = argmin(Err( w )) CS446 Machine Learning 30

Iterative batch learning Initialization: Initialize w 0 (the initial weight vector) For each iteration: for i = 0…T: Determine by how much to change w based on the entire data set D Δ w = computeDelta( D , w i ) Update w: w i+1 = update( w i , Δ w ) CS446 Machine Learning 31

Gradient Descent: Update 1. Compute ∇ Err( w i ), the gradient of the training error at w i This requires going over the entire training data T # & ∇ Err( w ) = ∂ Err( w ) , ∂ Err( w ) ,..., ∂ Err( w ) % ( ∂ w 0 ∂ w 1 ∂ w N $ ' 2. Update w : w i+1 = w i − α ∇ Err( w i ) α >0 is the learning rate CS446 Machine Learning 32

What’s a gradient? T # & ∇ Err( w ) = ∂ Err( w ) , ∂ Err( w ) ,..., ∂ Err( w ) % ( ∂ w 0 ∂ w 1 ∂ w N $ ' The gradient is a vector of partial derivatives It indicates the direction of steepest increase in Err( w ) Hence the minus in the upgrade rule: w i − α ∇ Err( w i ) CS446 Machine Learning 33

L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements Homework 1 will

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

L ECTURE 5: D YNAMICAL S YSTEMS 4 I NSTRUCTOR : G IANNI A. D I C ARO L INEAR M ULTI -D IMENSIONAL M

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Automatically Evading Classifiers A Case Study on PDF Malware Classifiers Weilin Xu

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Basis of Neural Networks School of Data Science, Fudan

GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof.

Week 3: Linear Regression Instructor: Sergey Levine 1 The regression problem We saw how we can

Statistical methods in bioinformatics Brief introduction, statistical models, dimension

Announcements Homework 1: Due today Office hours Come to office hours before your presentation!

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Going be y ond linear regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v