Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 12: Regularization, regression, and multi-class classification Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1
HW 2 2
Learning objective • Review homeworks and multi-class classification • Linear regression • Examine regularization in the regression context • Recognize the effects of regularization on bias/variance 3
Outline • Multi-class classification • Linear regression • Regularization 4
Outline • Multi-class classification • Linear regression • Regularization 5
Multi-class classification • Binary examples • Spam classification • Sentiment classification 6
Multi-class classification • Binary examples • Spam classification • Sentiment classification • Multi-class examples • Star-ratings classification • Part-of-speech tagging • Image classification 7
What we learned so far • KNN • Naïve Bayes • Logistic regression • Neural networks • Support vector machines 8
Binary vs. Multi-class classification 9
Multi-class logistic regression 10
11
Multi-class Support Vector Machines • Reduction • One-against-all • All-pairs • Modify objective function (SSBD 17.2) 12
Reduction 13
How do we use binary classifier to output categorical labels? 14
One-against-all 15
One-against-all • Break k- class problem into k binary problems and solve separately • Combine predictions: evaluates all h’s, take the one with highest confidence 16
One-against-all 17
All-pairs 18
All-pairs • Break k- class problem into k(k-1)/2 binary problems and solve separately • Combine predictions: evaluates all h’s, take the one with highest sum confidence 19
All-pairs 20
Outline • Multi-class classification • Linear regression • Regularization 21
Linear regression • Data are continuous inputs and outputs 22
Linear regression example • Given a person’s age and gender, predict their height • Given the square footage and number of bathrooms in a house, predict its sale price • Given unemployment, inflation, number of wars, and economics growth, predict the president’s approval rating • Given a user’s browsing history, predict how long he will stay on product page • Given the advertising budget expenditures in various markets, predict the number of products sold 23
Linear regression example 24
Linear regression example 25
Derived features 26
Derived features 27
Objective function The objective function is called the residual sum of squares : 28
Probabilistic interpretation A discriminative model that assumes the response Gaussian with mean 29
Probabilistic interpretation A discriminative model that assumes the response Gaussian with mean 30
Probabilistic interpretation Assuming i.i.d. samples, we can write the likelihood of the data as 31
Probabilistic interpretation Negative log likelihood 32
Probabilistic interpretation Negative log likelihood 33
Revisiting Bias-variance Tradeoff • Consider the case of fitting linear regression with derived polynomial features to a set of training data • In general, want a model that explains the training data and can still generalize to unseen test data 34
Revisiting Bias-variance Tradeoff 35
Outline • Multi-class classification • Linear regression • Regularization 36
High variance • Model wiggles wildly to get close to data • To get big swings, model coefficients are very large • weight go to 10 6 37
Regularization • Keep all the features, but force the coefficients to be smaller • This is called regularization 38
Regularization • Add penalty term to RSS objective function • Balance between small RSS and small coefficients 39
Regularization • Add penalty term to RSS objective function • Balance between small RSS and small coefficients • HW 2 extra credit question 40
41
Regularization 42
Ridge regularization 43
Ridge regularization 44
Bias-variance tradeoff The penalty decreases, then A. the bias increases, the variance increases B. the bias increases, the variance decreases C. the bias decreases, the variance increases D. the bias decreases, the variance decreases 45
Ridge regularization vs. lasso regularization • How do the coefficients behave as increases? λ 46
Ridge regularization • Coefficients shrink to zero uniformly smoothly 47
Lasso regularization • Some coefficients shrink to zero very fast 48
Ridge regularization vs. lasso regularization • Why does the choice between the two types of regularization lead to very different behavior? • Several ways to look at it • Constrained minimization • Look at a simplified case of data • Prior probabilities on parameters 49
Intuition 1: Constrained Minimization 50
Intuition 1: Constrained Minimization 51
Intuition 1: Constrained Minimization Minimum more likely to be at point of diamond with Lasso, causing some feature weights to be set to zero. 52
Intuition 2: A Simplified Case 53
Intuition 2: A Simplified Case 54
Intuition 2: A Simplified Case 55
Intuition 2: A Simplified Case 56
Intuition 3: Prior Distribution 57
Intuition 3: Prior Distribution 58
Intuition 3: Prior Distribution 59
Intuition 3: Prior Distribution 60
Intuition 3: Prior Distribution • Lasso's prior peaked at 0 means expect many params to be zero • Ridge's prior flatter and fatter around 0 means we expect many coefficients to be smallish 61
Wrap up • Regularization and the idea behind it is crucial for machine learning • Always use regularization in some form • Next • Ensemble methods 62
Recommend
More recommend