Optimization and (Under/over)fitting EECS 442 โ David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/
Regularized Least Squares Add regularization to objective that prefers some solutions: 2 arg min ๐ โ ๐๐ 2 Loss Before: ๐ 2 + ๐ ๐ 2 2 arg min ๐ โ ๐๐ 2 After: ๐ Loss Trade-off Regularization Want model โsmallerโ: pay a penalty for w with big norm Intuitive Objective: accurate model (low loss) but not too complex (low regularization). ฮป controls how much of each.
Nearest Neighbors Test Known Images Image Labels ๐ธ(๐ 1 , ๐ ๐ ) ๐ 1 ๐ ๐ Cat Cat! โฆ ๐ธ(๐ ๐ , ๐ ๐ ) (1) Compute distance between feature vectors (2) find nearest ๐ ๐ Dog (3) use label.
Picking Parameters What distance? What value for k / ฮป ? Training Validation Test Use these data Evaluate on these points for lookup points for different k, ฮป , distances
Linear Models Example Setup: 3 classes Model โ one weight per class: ๐ 0 , ๐ 1 , ๐ 2 ๐ ๐ big if cat ๐ 0 Want: ๐ ๐ big if dog ๐ 1 ๐ ๐ big if hippo ๐ 2 Stack together: where x is in R F ๐ฟ ๐๐๐ฎ
Linear Models Cat weight vector Cat score 0.2 -0.5 0.1 2.0 1.1 56 -96.8 Dog weight vector Dog score 1.5 1.3 2.1 0.0 3.2 231 437.9 Hippo weight vector Hippo score 0.0 0.3 0.2 -0.3 -1.2 24 61.95 2 ๐ฟ๐ ๐ ๐ฟ 1 Prediction is vector Weight matrix a collection of where jth component is scoring functions, one per class ๐ ๐ โscoreโ for jth class. Diagram by: Karpathy, Fei-Fei
Objective 1: Multiclass SVM* (Take the class whose Inference (x,y): arg max ๐ฟ๐ ๐ weight vector gives the k highest score) Training ( x i ,y i ): ๐ ๐ + เท arg min ๐ฟ ๐ ๐ฟ ๐ เท max 0, (๐ฟ๐ ๐ ๐ โ ๐ฟ๐ ๐ ๐ง ๐ ) ๐=1 ๐ โ ๐ง ๐ Regularization Pay no penalty if prediction Over all data for class y i is bigger than j. points Otherwise, pay proportional For every class to the score of the wrong j thatโs NOT the class. correct one (y i )
Objective 1: Multiclass SVM (Take the class whose Inference (x,y): arg max ๐ฟ๐ ๐ weight vector gives the k highest score) Training ( x i ,y i ): ๐ ๐ + เท arg min ๐ฟ ๐ ๐ฟ ๐ เท max 0, (๐ฟ๐ ๐ ๐ โ ๐ฟ๐ ๐ ๐ง ๐ + ๐) ๐=1 ๐ โ ๐ง ๐ Regularization Pay no penalty if prediction Over all data for class y i is bigger than j points by m (โmarginโ). Otherwise, For every class pay proportional to the j thatโs NOT the score of the wrong class. correct one (y i )
Objective 1: Called: Support Vector Machine Lots of great theory as to why this is a sensible thing to do. See Useful book (Free too!): The Elements of Statistical Learning Hastie, Tibshirani, Friedman https://web.stanford.edu/~hastie/ElemStatLearn/
Objective 2: Making Probabilities Converting Scores to โProbability Distributionโ -0.9 e -0.9 0.41 0.11 Cat score P(cat) Norm 0.4 exp(x) e 0.4 1.49 0.40 Dog score P(dog) 0.6 e 0.6 1.82 0.49 Hippo score P(hippo) โ=3.72 exp (๐๐ฆ ๐ ) Generally P(class j): ฯ ๐ exp( ๐๐ฆ ๐ ) Called softmax function Inspired by: Karpathy, Fei-Fei
Objective 2: Softmax (Take the class whose Inference (x): arg max ๐ฟ๐ ๐ weight vector gives the k highest score) exp (๐๐ฆ ๐ ) P(class j): ฯ ๐ exp( ๐๐ฆ ๐ ) Why can we skip the exp/sum exp thing to make a decision?
Objective 2: Softmax (Take the class whose Inference (x): arg max ๐ฟ๐ ๐ weight vector gives the k highest score) Training ( x i ,y i ): P(correct class) ๐ exp( ๐๐ฆ ๐ง ๐ ) ๐ + เท arg min ๐ฟ ๐ ๐ฟ ๐ โ log ฯ ๐ exp( ๐๐ฆ ๐ )) ๐=1 Regularization Over all data Pay penalty for not making points correct class likely. โNegative log - likelihoodโ
Objective 2: Softmax P(correct) = 0.05: 3.0 penalty P(correct) = 0.5: 0.11 penalty P(correct) = 0.9: 0.11 penalty P(correct) = 1: No penalty!
How Do We Optimize Things? Goal: find the w minimizing arg min ๐โ๐ ๐ ๐(๐) some loss function L. ๐ exp( ๐๐ฆ ๐ง ๐ ) ๐ + เท ๐ ๐ฟ = ๐ ๐ฟ ๐ โ log ฯ ๐ exp( ๐๐ฆ ๐ )) Works for lots of ๐=1 ๐ different Ls: 2 2 + เท ๐ง ๐ โ ๐ ๐ผ ๐ ๐ ๐(๐) = ๐ ๐ 2 ๐=1 ๐ 2 + เท max 0,1 โ ๐ง ๐ ๐ ๐ ๐ ๐ ๐ ๐ = ๐ท ๐ 2 ๐=1
Sample Function to Optimize f(x,y) = (x+2y-7) 2 + (2x+y-5) 2 Global minimum
Sample Function to Optimize โข Iโll switch back and forth between this 2D function (called the Booth Function ) and other more-learning-focused functions โข Beauty of optimization is that itโs all the same in principle โข But donโt draw too many conclusions: 2D space has qualitative differences from 1000D space See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS 2014 https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf
A Caveat โข Each point in the picture is a function evaluation โข Here it takes microseconds โ so we can easily see the answer โข Functions we want to optimize may take hours to evaluate
A Caveat Model in your head: moving around a landscape with a teleportation device 20 -20 -20 20 Landscape diagram: Karpathy and Fei-Fei
Option #1A โ Grid Search #systematically try things best, bestScore = None, Inf for dim1Value in dim1Values: โฆ. for dimNValue in dimNValues: w = [dim1Value, โฆ, dimNValue] if L( w ) < bestScore: best, bestScore = w , L( w ) return best
Option #1A โ Grid Search
Option #1A โ Grid Search Pros : Cons : 1. Super simple 1. Scales horribly to high 2. Only requires being dimensional spaces able to evaluate model Complexity: samplesPerDim numberOfDims
Option #1B โ Random Search #Do random stuff RANSAC Style best, bestScore = None, Inf for iter in range(numIters): w = random(N,1) #sample score = ๐ ๐ #evaluate if score < bestScore: best, bestScore = w , score return best
Option #1B โ Random Search
Option #1B โ Random Search Pros : Cons : 1. Super simple 1. Slow โ throwing darts 2. Only requires being at high dimensional able to sample model dart board and evaluate it 2. Might miss something P(all correct) = ฮต N ฮต Good parameters All parameters 1 0
When Do You Use Options 1A/1B? Use these when โข Number of dimensions small, space bounded โข Objective is impossible to analyze (e.g., test accuracy if we use this distance function) Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)
Option 2 โ Use The Gradient Arrows: gradient
Option 2 โ Use The Gradient Arrows: gradient direction (scaled to unit length)
Option 2 โ Use The Gradient arg min ๐ ๐(๐) Want: ๐๐/๐๐ 1 Whatโs the geometric โฎ โ ๐ ๐ ๐ = interpretation of: ๐๐/๐๐ ๐ Which is bigger (for small ฮฑ )? โค? ๐ ๐ ๐ ๐ + ๐ฝโ ๐ ๐(๐) >?
Option 2 โ Use The Gradient Arrows: gradient direction (scaled to unit length) ๐ ๐ + ๐ฝ๐
Option 2 โ Use The Gradient Method: at each step, move in direction of negative gradient w0 = initialize () #initialize for iter in range(numIters): g = โ ๐ ๐ ๐ #eval gradient w = w + - stepsize (iter)* g #update w return w
Gradient Descent Given starting point (blue) w i+1 = w i + -9.8x10 -2 x gradient
Computing the Gradient How Do You Compute The Gradient? Numerical Method: How do you compute this? ๐๐(๐ฅ) ๐๐(๐ฆ) ๐ ๐ฆ + ๐ โ ๐(๐ฆ) ๐๐ฆ 1 = lim ๐๐ฆ ๐ ๐โ0 โ ๐ ๐ ๐ = โฎ ๐๐(๐ฅ) In practice, use: ๐๐ฆ ๐ ๐ ๐ฆ + ๐ โ ๐ ๐ฆ โ ๐ 2๐
Computing the Gradient How Do You Compute The Gradient? Numerical Method: ๐ ๐ฆ + ๐ โ ๐ ๐ฆ โ ๐ Use: ๐๐(๐ฅ) 2๐ ๐๐ฆ 1 โ ๐ ๐ ๐ = โฎ How many function ๐๐(๐ฅ) evaluations per dimension? ๐๐ฆ ๐
Computing the Gradient How Do You Compute The Gradient? Analytical Method: ๐๐(๐ฅ) ๐๐ฆ 1 Use calculus! โ ๐ ๐ ๐ = โฎ ๐๐(๐ฅ) ๐๐ฆ ๐
Computing the Gradient ๐ 2 2 + เท ๐ง ๐ โ ๐ ๐ผ ๐ ๐ ๐(๐) = ๐ ๐ 2 ๐=1 ๐ ๐๐ ๐ โ 2 ๐ง ๐ โ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐ ๐(๐) = 2๐๐ + เท ๐=1 Note: if you look at other derivations, things are written either (y-w T x) or (w T x โ y); the gradients will differ by a minus.
Interpreting Gradients (1 Sample) Recall: w = w + - โ ๐ ๐ ๐ #update w โ ๐ ๐(๐) = 2๐๐ + โ 2 ๐ง โ ๐ ๐ ๐ ๐ ฮฑ Push w towards 0 โโ ๐ ๐ ๐ = โ2๐๐ + 2 ๐ง โ ๐ ๐ ๐ ๐ If y > w T x (too low ): then w = w + ฮฑ x for some ฮฑ Before : w T x After : (w+ ฮฑ x) T x = w T x + ฮฑ x T x
Quick annoying detail: subgradients What is the derivative of |x|? |x| Derivatives/Gradients Defined everywhere but 0 ๐ ๐ฆ โ 0 ๐๐ฆ ๐ ๐ฆ = sign(๐ฆ) undefined ๐ฆ = 0 Oh no! A discontinuity!
Recommend
More recommend