Linear models Subhransu Maji CMPSCI 689: Machine Learning 24 February 2015 26 February 2015
Overview Linear models � ‣ Perceptron: model and learning algorithm combined as one ‣ Is there a better way to learn linear models? We will separate models and learning algorithms � ‣ Learning as optimization } model design ‣ Surrogate loss function ‣ Regularization ‣ Gradient descent } optimization ‣ Batch and online gradients ‣ Subgradient descent ‣ Support vector machines CMPSCI 689 Subhransu Maji (UMASS) 2 /29
Learning as optimization 1 [ y n w T x n < 0] + λ R ( w ) X min w n fewest mistakes The perceptron algorithm will find an optimal w if the data is separable � ‣ efficiency depends on the margin and norm of the data However, if the data is not separable, optimizing this is NP-hard � ‣ i.e., there is no efficient way to minimize this unless P=NP CMPSCI 689 Subhransu Maji (UMASS) 3 /29
Learning as optimization hyperparameter 1 [ y n w T x n < 0] + λ R ( w ) X min w n fewest mistakes simpler model In addition to minimizing training error, we want a simpler model � ‣ Remember our goal is to minimize generalization error ‣ Recall the bias and variance tradeoff for learners We can add a regularization term R( w ) that prefers simpler models � ‣ For example we may prefer decision trees of shallow depth Here λ is a hyperparameter of optimization problem CMPSCI 689 Subhransu Maji (UMASS) 4 /29
Learning as optimization hyperparameter 1 [ y n w T x n < 0] + λ R ( w ) X min w n fewest mistakes simpler model The questions that remain are: � ‣ What are good ways to adjust the optimization problem so that there are efficient algorithms for solving it? ‣ What are good regularizations R( w ) for hyperplanes? ‣ Assuming that the optimization problem can be adjusted appropriately, what algorithms exist for solving the regularized optimization problem? CMPSCI 689 Subhransu Maji (UMASS) 5 /29
Convex surrogate loss functions Zero/one loss is hard to optimize � concave ‣ Small changes in w can cause large changes in the loss Surrogate loss: replace Zero/one loss by a smooth function � ‣ Easier to optimize if the surrogate loss is convex convex Examples: y ← w T x 9 y = +1 ˆ Zero/one Hinge 8 Logistic Exponential 7 Squared 6 Prediction 5 4 3 2 1 0 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 2.5 3 Loss CMPSCI 689 Subhransu Maji (UMASS) 6 /29
Weight regularization What are good regularization functions R( w ) for hyperplanes? � We would like the weights — � ‣ To be small — ➡ Change in the features cause small change to the score ➡ Robustness to noise ‣ To be sparse — ➡ Use as few features as possible ➡ Similar to controlling the depth of a decision tree This is a form of inductive bias CMPSCI 689 Subhransu Maji (UMASS) 7 /29
Weight regularization Just like the surrogate loss function, we would like R( w ) to be convex � Small weights regularization � � sX R (norm) ( w ) = R (sqrd) ( w ) = X w 2 � w 2 d d � d d � Sparsity regularization � � R (count) ( w ) = X not convex 1 [ | w d | > 0] � � d Family of “p-norm” regularization ! 1 /p X R (p-norm) ( w ) = | w d | p d CMPSCI 689 Subhransu Maji (UMASS) 8 /29
Contours of p-norms convex for p ≥ 1 http://en.wikipedia.org/wiki/Lp_space CMPSCI 689 Subhransu Maji (UMASS) 9 /29
Contours of p-norms not convex for 0 ≤ p < 1 p = 2 3 Counting non-zeros: p = 0 X R (count) ( w ) = 1 [ | w d | > 0] d http://en.wikipedia.org/wiki/Lp_space CMPSCI 689 Subhransu Maji (UMASS) 10 /29
General optimization framework hyperparameter X y n , w T x n � � min + � R ( w ) ` w n surrogate loss regularization Select a suitable: � ‣ convex surrogate loss ‣ convex regularization Select the hyperparameter λ� Minimize the regularized objective with respect to w � This framework for optimization is called Tikhonov regularization or generally Structural Risk Minimization (SRM) http://en.wikipedia.org/wiki/Tikhonov_regularization CMPSCI 689 Subhransu Maji (UMASS) 11 /29
Optimization by gradient descent Convex function g ( k ) r p F ( p ) | p k p 1 compute gradient at the current location p 2 step size η 1 p 3 p k +1 ← p k − η k g ( k ) η 2 p 4 take a step down the gradient p 5 p 6 η 3 local optima = global optima Non-convex function local optima global optima CMPSCI 689 Subhransu Maji (UMASS) 12 /29
Choice of step size Good step size The step size is important — � p 1 ‣ too small: slow convergence p 2 ‣ too large: no convergence η 1 p 3 A strategy is to use large step sizes initially p 4 and small step sizes later: � p 5 p 6 � η t ← η 0 / ( t 0 + t ) � There are methods that converge faster by Bad step size adapting step size to the curvature of the function � η 1 p 1 p 2 ‣ Field of convex optimization p 3 p 4 p 5 p 6 http://stanford.edu/~boyd/cvxbook/ CMPSCI 689 Subhransu Maji (UMASS) 13 /29
Example: Exponential loss exp( − y n w T x n ) + λ X 2 || w || 2 L ( w ) = objective n d L X − y n x n exp( − y n w T x n ) + λ w d w = gradient n X ! − y n x n exp( − y n w T x n ) + λ w update w ← w − η n regularization term loss term w ← (1 − ηλ ) w w ← w + cy n x n high for misclassified points shrinks weights towards zero similar to the perceptron update rule! CMPSCI 689 Subhransu Maji (UMASS) 14 /29
Batch and online gradients X L ( w ) = L n ( w ) objective n w ← w − η d L gradient descent d w batch gradient online gradient X ! d L n ✓ d L n ◆ w ← w − η w ← w − η d w d w n sum of n gradients gradient at n th point update weight after you see all points update weights after you see each point Online gradients are the default method for multi-layer perceptrons CMPSCI 689 Subhransu Maji (UMASS) 15 /29
Subgradient ` (hinge) ( y, w T x ) = max(0 , 1 − y w T x ) z z 1 subgradient The hinge loss is not differentiable at z=1 � Subgradient is any direction that is below the function � For the hinge loss a possible subgradient is: if y w T x > 1 ⇢ 0 d ` hinge = d w otherwise − y x CMPSCI 689 Subhransu Maji (UMASS) 16 /29
Example: Hinge loss max(0 , 1 − y n w T x n ) + λ X 2 || w || 2 L ( w ) = objective n d L − 1 [ y n w T x n ≤ 1] y n x n + λ w X d w = subgradient n X ! − 1 [ y n w T x n ≤ 1] y n x n + λ w w ← w − η update n regularization term loss term w ← w + η y n x n w ← (1 − ηλ ) w y n w T x n ≤ 1 only for points shrinks weights towards zero perceptron update y n w T x n ≤ 0 CMPSCI 689 Subhransu Maji (UMASS) 17 /29
Example: Squared loss � 2 + λ X y n − w T x n 2 || w || 2 � L ( w ) = objective n matrix notation equivalent loss CMPSCI 689 Subhransu Maji (UMASS) 18 /29
Example: Squared loss objective gradient At optima the gradient=0 exact � closed-form � solution CMPSCI 689 Subhransu Maji (UMASS) 19 /29
Matrix inversion vs. gradient descent Assume, we have D features and N points � Overall time via matrix inversion � ‣ The closed form solution involves computing: � � ‣ Total time is O(D 2 N + D 3 + DN), assuming O(D 3 ) matrix inversion ‣ If N > D, then total time is O(D 2 N) Overall time via gradient descent � ‣ Gradient: d L X − 2( y n − w T x n ) x n + λ w d w = � n ‣ Each iteration: O(ND); T iterations: O(TND) Which one is faster? � ‣ Small problems D < 100: probably faster to run matrix inversion ‣ Large problems D > 10,000: probably faster to run gradient descent CMPSCI 689 Subhransu Maji (UMASS) 20 /29
Picking a good hyperplane Which hyperplane is the best? CMPSCI 689 Subhransu Maji (UMASS) 21 /29
Support Vector Machines (SVMs) Maximize the distance to the nearest point (margin), while correctly classifying all the points w margin δ ( w ) CMPSCI 689 Subhransu Maji (UMASS) 22 /29
Optimization for SVMs Separable case: hard margin SVM 1 min δ ( w ) w maximize margin subject to: y n w T x n ≥ 1 , ∀ n separate by a non-trivial margin Non-separable case: soft margin SVM 1 X min δ ( w ) + C ξ n w n maximize margin minimize slack subject to: y n w T x n ≥ 1 − ξ n , ∀ n allow some slack ξ n ≥ 0 CMPSCI 689 Subhransu Maji (UMASS) 23 /29
Margin of a classifier 1 δ ( w ) = || w || w T x − 1 = 0 1 min δ ( w ) ≡ min w || w || w w T x + 1 = 0 maximizing margin = minimizing norm w margin δ ( w ) CMPSCI 689 Subhransu Maji (UMASS) 24 /29
Equivalent optimization for SVMs Separable case: hard margin SVM 1 2 || w || 2 squaring and half for convenience min w maximize margin subject to: y n w T x n ≥ 1 , ∀ n separate by a non-trivial margin Non-separable case: soft margin SVM 1 2 || w || 2 + C X ξ n min w n maximize margin minimize slack subject to: y n w T x n ≥ 1 − ξ n , ∀ n allow some slack ξ n ≥ 0 CMPSCI 689 Subhransu Maji (UMASS) 25 /29
Recommend
More recommend