under over fitting
play

(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - PowerPoint PPT Presentation

Optimization and (Under/over)fitting EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Regularized Least Squares Add regularization to objective that prefers some solutions: 2


  1. Optimization and (Under/over)fitting EECS 442 โ€“ David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/

  2. Regularized Least Squares Add regularization to objective that prefers some solutions: 2 arg min ๐’› โˆ’ ๐’€๐’™ 2 Loss Before: ๐’™ 2 + ๐œ‡ ๐’™ 2 2 arg min ๐’› โˆ’ ๐’€๐’™ 2 After: ๐’™ Loss Trade-off Regularization Want model โ€œsmallerโ€: pay a penalty for w with big norm Intuitive Objective: accurate model (low loss) but not too complex (low regularization). ฮป controls how much of each.

  3. Nearest Neighbors Test Known Images Image Labels ๐ธ(๐’š 1 , ๐’š ๐‘ˆ ) ๐’š 1 ๐’š ๐‘ˆ Cat Cat! โ€ฆ ๐ธ(๐’š ๐‘‚ , ๐’š ๐‘ˆ ) (1) Compute distance between feature vectors (2) find nearest ๐’š ๐‘‚ Dog (3) use label.

  4. Picking Parameters What distance? What value for k / ฮป ? Training Validation Test Use these data Evaluate on these points for lookup points for different k, ฮป , distances

  5. Linear Models Example Setup: 3 classes Model โ€“ one weight per class: ๐’™ 0 , ๐’™ 1 , ๐’™ 2 ๐‘ˆ ๐’š big if cat ๐’™ 0 Want: ๐‘ˆ ๐’š big if dog ๐’™ 1 ๐‘ˆ ๐’š big if hippo ๐’™ 2 Stack together: where x is in R F ๐‘ฟ ๐Ÿ’๐’š๐‘ฎ

  6. Linear Models Cat weight vector Cat score 0.2 -0.5 0.1 2.0 1.1 56 -96.8 Dog weight vector Dog score 1.5 1.3 2.1 0.0 3.2 231 437.9 Hippo weight vector Hippo score 0.0 0.3 0.2 -0.3 -1.2 24 61.95 2 ๐‘ฟ๐’š ๐’‹ ๐‘ฟ 1 Prediction is vector Weight matrix a collection of where jth component is scoring functions, one per class ๐’š ๐’‹ โ€œscoreโ€ for jth class. Diagram by: Karpathy, Fei-Fei

  7. Objective 1: Multiclass SVM* (Take the class whose Inference (x,y): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) Training ( x i ,y i ): ๐‘œ ๐Ÿ‘ + เท arg min ๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ เท max 0, (๐‘ฟ๐’š ๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š ๐’‹ ๐‘ง ๐‘— ) ๐‘—=1 ๐‘˜ โ‰ ๐‘ง ๐‘— Regularization Pay no penalty if prediction Over all data for class y i is bigger than j. points Otherwise, pay proportional For every class to the score of the wrong j thatโ€™s NOT the class. correct one (y i )

  8. Objective 1: Multiclass SVM (Take the class whose Inference (x,y): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) Training ( x i ,y i ): ๐‘œ ๐Ÿ‘ + เท arg min ๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ เท max 0, (๐‘ฟ๐’š ๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š ๐’‹ ๐‘ง ๐‘— + ๐‘›) ๐‘—=1 ๐‘˜ โ‰ ๐‘ง ๐‘— Regularization Pay no penalty if prediction Over all data for class y i is bigger than j points by m (โ€œmarginโ€). Otherwise, For every class pay proportional to the j thatโ€™s NOT the score of the wrong class. correct one (y i )

  9. Objective 1: Called: Support Vector Machine Lots of great theory as to why this is a sensible thing to do. See Useful book (Free too!): The Elements of Statistical Learning Hastie, Tibshirani, Friedman https://web.stanford.edu/~hastie/ElemStatLearn/

  10. Objective 2: Making Probabilities Converting Scores to โ€œProbability Distributionโ€ -0.9 e -0.9 0.41 0.11 Cat score P(cat) Norm 0.4 exp(x) e 0.4 1.49 0.40 Dog score P(dog) 0.6 e 0.6 1.82 0.49 Hippo score P(hippo) โˆ‘=3.72 exp (๐‘‹๐‘ฆ ๐‘˜ ) Generally P(class j): ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ ) Called softmax function Inspired by: Karpathy, Fei-Fei

  11. Objective 2: Softmax (Take the class whose Inference (x): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) exp (๐‘‹๐‘ฆ ๐‘˜ ) P(class j): ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ ) Why can we skip the exp/sum exp thing to make a decision?

  12. Objective 2: Softmax (Take the class whose Inference (x): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) Training ( x i ,y i ): P(correct class) ๐‘œ exp( ๐‘‹๐‘ฆ ๐‘ง ๐‘— ) ๐Ÿ‘ + เท arg min ๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ โˆ’ log ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ )) ๐‘—=1 Regularization Over all data Pay penalty for not making points correct class likely. โ€œNegative log - likelihoodโ€

  13. Objective 2: Softmax P(correct) = 0.05: 3.0 penalty P(correct) = 0.5: 0.11 penalty P(correct) = 0.9: 0.11 penalty P(correct) = 1: No penalty!

  14. How Do We Optimize Things? Goal: find the w minimizing arg min ๐’™โˆˆ๐‘† ๐‘‚ ๐‘€(๐’™) some loss function L. ๐‘œ exp( ๐‘‹๐‘ฆ ๐‘ง ๐‘— ) ๐Ÿ‘ + เท ๐‘€ ๐‘ฟ = ๐ ๐‘ฟ ๐Ÿ‘ โˆ’ log ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ )) Works for lots of ๐‘—=1 ๐‘œ different Ls: 2 2 + เท ๐‘ง ๐‘— โˆ’ ๐’™ ๐‘ผ ๐’š ๐’‹ ๐‘€(๐’™) = ๐œ‡ ๐’™ 2 ๐‘—=1 ๐‘œ 2 + เท max 0,1 โˆ’ ๐‘ง ๐‘— ๐’™ ๐‘ˆ ๐’š ๐’‹ ๐‘€ ๐’™ = ๐ท ๐’™ 2 ๐‘—=1

  15. Sample Function to Optimize f(x,y) = (x+2y-7) 2 + (2x+y-5) 2 Global minimum

  16. Sample Function to Optimize โ€ข Iโ€™ll switch back and forth between this 2D function (called the Booth Function ) and other more-learning-focused functions โ€ข Beauty of optimization is that itโ€™s all the same in principle โ€ข But donโ€™t draw too many conclusions: 2D space has qualitative differences from 1000D space See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS 2014 https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

  17. A Caveat โ€ข Each point in the picture is a function evaluation โ€ข Here it takes microseconds โ€“ so we can easily see the answer โ€ข Functions we want to optimize may take hours to evaluate

  18. A Caveat Model in your head: moving around a landscape with a teleportation device 20 -20 -20 20 Landscape diagram: Karpathy and Fei-Fei

  19. Option #1A โ€“ Grid Search #systematically try things best, bestScore = None, Inf for dim1Value in dim1Values: โ€ฆ. for dimNValue in dimNValues: w = [dim1Value, โ€ฆ, dimNValue] if L( w ) < bestScore: best, bestScore = w , L( w ) return best

  20. Option #1A โ€“ Grid Search

  21. Option #1A โ€“ Grid Search Pros : Cons : 1. Super simple 1. Scales horribly to high 2. Only requires being dimensional spaces able to evaluate model Complexity: samplesPerDim numberOfDims

  22. Option #1B โ€“ Random Search #Do random stuff RANSAC Style best, bestScore = None, Inf for iter in range(numIters): w = random(N,1) #sample score = ๐‘€ ๐’™ #evaluate if score < bestScore: best, bestScore = w , score return best

  23. Option #1B โ€“ Random Search

  24. Option #1B โ€“ Random Search Pros : Cons : 1. Super simple 1. Slow โ€“ throwing darts 2. Only requires being at high dimensional able to sample model dart board and evaluate it 2. Might miss something P(all correct) = ฮต N ฮต Good parameters All parameters 1 0

  25. When Do You Use Options 1A/1B? Use these when โ€ข Number of dimensions small, space bounded โ€ข Objective is impossible to analyze (e.g., test accuracy if we use this distance function) Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)

  26. Option 2 โ€“ Use The Gradient Arrows: gradient

  27. Option 2 โ€“ Use The Gradient Arrows: gradient direction (scaled to unit length)

  28. Option 2 โ€“ Use The Gradient arg min ๐’™ ๐‘€(๐’™) Want: ๐œ–๐‘€/๐œ–๐’š 1 Whatโ€™s the geometric โ‹ฎ โˆ‡ ๐’™ ๐‘€ ๐’™ = interpretation of: ๐œ–๐‘€/๐œ–๐’š ๐‘‚ Which is bigger (for small ฮฑ )? โ‰ค? ๐‘€ ๐’™ ๐‘€ ๐’™ + ๐›ฝโˆ‡ ๐’™ ๐‘€(๐’™) >?

  29. Option 2 โ€“ Use The Gradient Arrows: gradient direction (scaled to unit length) ๐’š ๐’š + ๐›ฝ๐›‚

  30. Option 2 โ€“ Use The Gradient Method: at each step, move in direction of negative gradient w0 = initialize () #initialize for iter in range(numIters): g = โˆ‡ ๐’™ ๐‘€ ๐’™ #eval gradient w = w + - stepsize (iter)* g #update w return w

  31. Gradient Descent Given starting point (blue) w i+1 = w i + -9.8x10 -2 x gradient

  32. Computing the Gradient How Do You Compute The Gradient? Numerical Method: How do you compute this? ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘”(๐‘ฆ) ๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘”(๐‘ฆ) ๐œ–๐‘ฆ 1 = lim ๐œ–๐‘ฆ ๐œ— ๐œ—โ†’0 โˆ‡ ๐’™ ๐‘€ ๐’™ = โ‹ฎ ๐œ–๐‘€(๐‘ฅ) In practice, use: ๐œ–๐‘ฆ ๐‘œ ๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘” ๐‘ฆ โˆ’ ๐œ— 2๐œ—

  33. Computing the Gradient How Do You Compute The Gradient? Numerical Method: ๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘” ๐‘ฆ โˆ’ ๐œ— Use: ๐œ–๐‘€(๐‘ฅ) 2๐œ— ๐œ–๐‘ฆ 1 โˆ‡ ๐’™ ๐‘€ ๐’™ = โ‹ฎ How many function ๐œ–๐‘€(๐‘ฅ) evaluations per dimension? ๐œ–๐‘ฆ ๐‘œ

  34. Computing the Gradient How Do You Compute The Gradient? Analytical Method: ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ 1 Use calculus! โˆ‡ ๐’™ ๐‘€ ๐’™ = โ‹ฎ ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ ๐‘œ

  35. Computing the Gradient ๐‘œ 2 2 + เท ๐‘ง ๐‘— โˆ’ ๐’™ ๐‘ผ ๐’š ๐’‹ ๐‘€(๐’™) = ๐œ‡ ๐’™ 2 ๐‘—=1 ๐œ– ๐œ–๐’™ ๐‘œ โˆ’ 2 ๐‘ง ๐‘— โˆ’ ๐’™ ๐‘ˆ ๐’š ๐’‹ ๐’š ๐‘— โˆ‡ ๐’™ ๐‘€(๐’™) = 2๐œ‡๐’™ + เท ๐‘—=1 Note: if you look at other derivations, things are written either (y-w T x) or (w T x โ€“ y); the gradients will differ by a minus.

  36. Interpreting Gradients (1 Sample) Recall: w = w + - โˆ‡ ๐’™ ๐‘€ ๐’™ #update w โˆ‡ ๐’™ ๐‘€(๐’™) = 2๐œ‡๐’™ + โˆ’ 2 ๐‘ง โˆ’ ๐’™ ๐‘ˆ ๐’š ๐’š ฮฑ Push w towards 0 โˆ’โˆ‡ ๐’™ ๐‘€ ๐’™ = โˆ’2๐œ‡๐’™ + 2 ๐‘ง โˆ’ ๐’™ ๐‘ˆ ๐’š ๐’š If y > w T x (too low ): then w = w + ฮฑ x for some ฮฑ Before : w T x After : (w+ ฮฑ x) T x = w T x + ฮฑ x T x

  37. Quick annoying detail: subgradients What is the derivative of |x|? |x| Derivatives/Gradients Defined everywhere but 0 ๐œ– ๐‘ฆ โ‰  0 ๐œ–๐‘ฆ ๐‘” ๐‘ฆ = sign(๐‘ฆ) undefined ๐‘ฆ = 0 Oh no! A discontinuity!

Recommend


More recommend