Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 - PowerPoint PPT Presentation

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015

Overview Linear models � ‣ Perceptron: model and learning algorithm combined as one ‣ Is there a better way to learn linear models? We will separate models and learning algorithms � ‣ Learning as optimization } model design ‣ Surrogate loss function ‣ Regularization ‣ Gradient descent } optimization ‣ Batch and online gradients ‣ Subgradient descent ‣ Support vector machines CMPSCI 689 Subhransu Maji (UMASS) 2 /29

Learning as optimization 1 [ y n w T x n < 0] + λ R ( w ) X min w n fewest mistakes The perceptron algorithm will find an optimal w if the data is separable � ‣ efficiency depends on the margin and norm of the data However, if the data is not separable, optimizing this is NP-hard � ‣ i.e., there is no efficient way to minimize this unless P=NP CMPSCI 689 Subhransu Maji (UMASS) 3 /29

Learning as optimization hyperparameter 1 [ y n w T x n < 0] + λ R ( w ) X min w n fewest mistakes simpler model In addition to minimizing training error, we want a simpler model � ‣ Remember our goal is to minimize generalization error ‣ Recall the bias and variance tradeoff for learners We can add a regularization term R( w ) that prefers simpler models � ‣ For example we may prefer decision trees of shallow depth Here λ is a hyperparameter of optimization problem CMPSCI 689 Subhransu Maji (UMASS) 4 /29

Learning as optimization hyperparameter 1 [ y n w T x n < 0] + λ R ( w ) X min w n fewest mistakes simpler model The questions that remain are: � ‣ What are good ways to adjust the optimization problem so that there are efficient algorithms for solving it? ‣ What are good regularizations R( w ) for hyperplanes? ‣ Assuming that the optimization problem can be adjusted appropriately, what algorithms exist for solving the regularized optimization problem? CMPSCI 689 Subhransu Maji (UMASS) 5 /29

Convex surrogate loss functions Zero/one loss is hard to optimize � concave ‣ Small changes in w can cause large changes in the loss Surrogate loss: replace Zero/one loss by a smooth function � ‣ Easier to optimize if the surrogate loss is convex convex Examples: y ← w T x 9 y = +1 ˆ Zero/one Hinge 8 Logistic Exponential 7 Squared 6 Prediction 5 4 3 2 1 0 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 2.5 3 Loss CMPSCI 689 Subhransu Maji (UMASS) 6 /29

Weight regularization What are good regularization functions R( w ) for hyperplanes? � We would like the weights — � ‣ To be small — ➡ Change in the features cause small change to the score ➡ Robustness to noise ‣ To be sparse — ➡ Use as few features as possible ➡ Similar to controlling the depth of a decision tree This is a form of inductive bias CMPSCI 689 Subhransu Maji (UMASS) 7 /29

Weight regularization Just like the surrogate loss function, we would like R( w ) to be convex � Small weights regularization � � sX R (norm) ( w ) = R (sqrd) ( w ) = X w 2 � w 2 d d � d d � Sparsity regularization � � R (count) ( w ) = X not convex 1 [ | w d | > 0] � � d Family of “p-norm” regularization ! 1 /p X R (p-norm) ( w ) = | w d | p d CMPSCI 689 Subhransu Maji (UMASS) 8 /29

Contours of p-norms convex for p ≥ 1 http://en.wikipedia.org/wiki/Lp_space CMPSCI 689 Subhransu Maji (UMASS) 9 /29

Contours of p-norms not convex for 0 ≤ p < 1 p = 2 3 Counting non-zeros: p = 0 X R (count) ( w ) = 1 [ | w d | > 0] d http://en.wikipedia.org/wiki/Lp_space CMPSCI 689 Subhransu Maji (UMASS) 10 /29

General optimization framework hyperparameter X y n , w T x n � � min + � R ( w ) ` w n surrogate loss regularization Select a suitable: � ‣ convex surrogate loss ‣ convex regularization Select the hyperparameter λ� Minimize the regularized objective with respect to w � This framework for optimization is called Tikhonov regularization or generally Structural Risk Minimization (SRM) http://en.wikipedia.org/wiki/Tikhonov_regularization CMPSCI 689 Subhransu Maji (UMASS) 11 /29

Optimization by gradient descent Convex function g ( k ) r p F ( p ) | p k p 1 compute gradient at the current location p 2 step size η 1 p 3 p k +1 ← p k − η k g ( k ) η 2 p 4 take a step down the gradient p 5 p 6 η 3 local optima = global optima Non-convex function local optima global optima CMPSCI 689 Subhransu Maji (UMASS) 12 /29

Choice of step size Good step size The step size is important — � p 1 ‣ too small: slow convergence p 2 ‣ too large: no convergence η 1 p 3 A strategy is to use large step sizes initially p 4 and small step sizes later: � p 5 p 6 � η t ← η 0 / ( t 0 + t ) � There are methods that converge faster by Bad step size adapting step size to the curvature of the function � η 1 p 1 p 2 ‣ Field of convex optimization p 3 p 4 p 5 p 6 http://stanford.edu/~boyd/cvxbook/ CMPSCI 689 Subhransu Maji (UMASS) 13 /29

Example: Exponential loss exp( − y n w T x n ) + λ X 2 || w || 2 L ( w ) = objective n d L X − y n x n exp( − y n w T x n ) + λ w d w = gradient n X ! − y n x n exp( − y n w T x n ) + λ w update w ← w − η n regularization term loss term w ← (1 − ηλ ) w w ← w + cy n x n high for misclassified points shrinks weights towards zero similar to the perceptron update rule! CMPSCI 689 Subhransu Maji (UMASS) 14 /29

Batch and online gradients X L ( w ) = L n ( w ) objective n w ← w − η d L gradient descent d w batch gradient online gradient X ! d L n ✓ d L n ◆ w ← w − η w ← w − η d w d w n sum of n gradients gradient at n th point update weight after you see all points update weights after you see each point Online gradients are the default method for multi-layer perceptrons CMPSCI 689 Subhransu Maji (UMASS) 15 /29

Subgradient ` (hinge) ( y, w T x ) = max(0 , 1 − y w T x ) z z 1 subgradient The hinge loss is not differentiable at z=1 � Subgradient is any direction that is below the function � For the hinge loss a possible subgradient is: if y w T x > 1 ⇢ 0 d ` hinge = d w otherwise − y x CMPSCI 689 Subhransu Maji (UMASS) 16 /29

Example: Hinge loss max(0 , 1 − y n w T x n ) + λ X 2 || w || 2 L ( w ) = objective n d L − 1 [ y n w T x n ≤ 1] y n x n + λ w X d w = subgradient n X ! − 1 [ y n w T x n ≤ 1] y n x n + λ w w ← w − η update n regularization term loss term w ← w + η y n x n w ← (1 − ηλ ) w y n w T x n ≤ 1 only for points shrinks weights towards zero perceptron update y n w T x n ≤ 0 CMPSCI 689 Subhransu Maji (UMASS) 17 /29

Example: Squared loss � 2 + λ X y n − w T x n 2 || w || 2 � L ( w ) = objective n matrix notation equivalent loss CMPSCI 689 Subhransu Maji (UMASS) 18 /29

Example: Squared loss objective gradient At optima the gradient=0 exact � closed-form � solution CMPSCI 689 Subhransu Maji (UMASS) 19 /29

Matrix inversion vs. gradient descent Assume, we have D features and N points � Overall time via matrix inversion � ‣ The closed form solution involves computing: � � ‣ Total time is O(D 2 N + D 3 + DN), assuming O(D 3 ) matrix inversion ‣ If N > D, then total time is O(D 2 N) Overall time via gradient descent � ‣ Gradient: d L X − 2( y n − w T x n ) x n + λ w d w = � n ‣ Each iteration: O(ND); T iterations: O(TND) Which one is faster? � ‣ Small problems D < 100: probably faster to run matrix inversion ‣ Large problems D > 10,000: probably faster to run gradient descent CMPSCI 689 Subhransu Maji (UMASS) 20 /29

Picking a good hyperplane Which hyperplane is the best? CMPSCI 689 Subhransu Maji (UMASS) 21 /29

Support Vector Machines (SVMs) Maximize the distance to the nearest point (margin), while correctly classifying all the points w margin δ ( w ) CMPSCI 689 Subhransu Maji (UMASS) 22 /29

Optimization for SVMs Separable case: hard margin SVM 1 min δ ( w ) w maximize margin subject to: y n w T x n ≥ 1 , ∀ n separate by a non-trivial margin Non-separable case: soft margin SVM 1 X min δ ( w ) + C ξ n w n maximize margin minimize slack subject to: y n w T x n ≥ 1 − ξ n , ∀ n allow some slack ξ n ≥ 0 CMPSCI 689 Subhransu Maji (UMASS) 23 /29

Margin of a classifier 1 δ ( w ) = || w || w T x − 1 = 0 1 min δ ( w ) ≡ min w || w || w w T x + 1 = 0 maximizing margin = minimizing norm w margin δ ( w ) CMPSCI 689 Subhransu Maji (UMASS) 24 /29

Equivalent optimization for SVMs Separable case: hard margin SVM 1 2 || w || 2 squaring and half for convenience min w maximize margin subject to: y n w T x n ≥ 1 , ∀ n separate by a non-trivial margin Non-separable case: soft margin SVM 1 2 || w || 2 + C X ξ n min w n maximize margin minimize slack subject to: y n w T x n ≥ 1 − ξ n , ∀ n allow some slack ξ n ≥ 0 CMPSCI 689 Subhransu Maji (UMASS) 25 /29

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 - PowerPoint PPT Presentation

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015 Overview Linear models Perceptron: model and learning algorithm combined as one Is there a better way to learn linear models? We will

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Linear Programming Linear Programming In a linear programming problem, there is a set of

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017

Clarity of Record Pilot: Interview Summaries and Pre-search Interview Option 2/25/2016 1 DRAFT

L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu Midterm (Thursday, March

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Linking Rigid Bodies Symmetrically Bernd Schulze 1 and Shin-ichi Tanigawa 2 1 Lancaster Unviersity,

Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits Martin Zhang joint work with:

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

A Bayesian Method for Partially Paired High Dimensional Data Fei Liu, Feng Liang, Woncheol Jang

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 - PowerPoint PPT Presentation

Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015 Overview Linear models Perceptron: model and learning algorithm combined as one Is there a better way to learn linear models? We will

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

Workshop 2 Building from Linear Models to Generalised Linear Models Part 1: understanding LMs 2

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

ECON 950 Winter 2020 Prof. James MacKinnon 9. Going Beyond Linear Models Linear regression,

Outline Statistical inference for linear mixed models general form of linear mixed models

Categorical Semantics for Linear Logic Categorical semantics for linear logic Interaction

Linear Programming Linear Programming In a linear programming problem, there is a set of

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017

Clarity of Record Pilot: Interview Summaries and Pre-search Interview Option 2/25/2016 1 DRAFT

L ECTURE 11: S OFT SVM S Prof. Julia Hockenmaier juliahmr@illinois.edu Midterm (Thursday, March

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Linking Rigid Bodies Symmetrically Bernd Schulze 1 and Shin-ichi Tanigawa 2 1 Lancaster Unviersity,

Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits Martin Zhang joint work with:

B I O I N F O R M A T I C S Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling

A Bayesian Method for Partially Paired High Dimensional Data Fei Liu, Feng Liang, Woncheol Jang

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE