introduction to machine learning
play

Introduction to Machine Learning 2. Basic Tools Alex Smola & - PowerPoint PPT Presentation

Introduction to Machine Learning 2. Basic Tools Alex Smola & Geoff Gordon Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701 This is not a toy dataset


  1. Introduction to Machine Learning 2. Basic Tools Alex Smola & Geoff Gordon Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701

  2. This is not a toy dataset http://wadejohnston1962.files.wordpress.com/2012/09/datainoneminute.jpg

  3. Linear Regression

  4. Linear Regression • Observations x, labels y • Minimize squared distance • Linear function m X ∂ a [ . . . ] = 0 = x i ( ax i + b − y i ) f ( x ) = ax + b i =1 m 1 m X 2( ax i + b − y i ) 2 minimize X ∂ b [ . . . ] = 0 = ( ax i + b − y ) a,b i =1 i =1

  5. Linear Regression • Optimization Problem f ( x ) = h a, x i + b = h w, ( x, 1) i m 1 X x i i � y i ) 2 minimize 2( h w, ¯ w i =1 • Solving it " m # m m X X X x > 0 = x i ( h w, ¯ ¯ x i i � y i ) ( ) x i ¯ ¯ w = y i ¯ x i i i =1 i =1 i =1 only requires a matrix inversion.

  6. Nonlinear Regression • Linear model f ( x ) = h w, (1 , x ) i • Quadratic model w, (1 , x, x 2 ) ⌦ ↵ f ( x ) = • Cubic model w, (1 , x, x 2 , x 3 ) ⌦ ↵ f ( x ) = • Nonlinear model f ( x ) = h w, φ ( x ) i

  7. Linear Regression • Optimization Problem f ( x ) = h a, x i + b = h w, ( x, 1) i m 1 X x i i � y i ) 2 minimize 2( h w, ¯ w i =1 • Solving it " m # m m X X X x > 0 = x i ( h w, ¯ ¯ x i i � y i ) ( ) x i ¯ ¯ w = y i ¯ x i i i =1 i =1 i =1 only requires a matrix inversion.

  8. Nonlinear Regression • Optimization Problem f ( x ) = h w, φ ( x ) i m 1 X 2( h w, φ ( x i ) i � y i ) 2 minimize w i =1 • Solving it " m # m m X X X φ ( x i ) φ ( x i ) > 0 = φ ( x i )( h w, φ ( x i ) i � y i ) ( ) w = y i φ ( x i ) i =1 i =1 i =1 only requires a matrix inversion.

  9. Pseudocode (degree 4) Training phi_xx = [xx.^4, xx.^3, xx.^2, xx, 1.0 + 0.0 * xx]; w = (yy' * phi_xx) / (phi_xx' * phi_xx); Testing phi_x = [x.^4, x.^3, x.^2, x, 1.0 + 0.0 * x]; y = phi_x * w';

  10. Regression (d=1)

  11. Regression (d=2)

  12. Regression (d=3)

  13. Regression (d=4)

  14. Regression (d=5)

  15. Regression (d=6)

  16. Regression (d=7)

  17. Regression (d=8)

  18. Regression (d=9)

  19. Nonlinear Regression warning: matrix singular to machine precision, rcond = 5.8676e-19 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 5.86761e-19 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 8x8 matrix, rank = 7 warning: matrix singular to machine precision, rcond = 1.10156e-21 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.10145e-21 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 9x9 matrix, rank = 6 warning: matrix singular to machine precision, rcond = 2.16217e-26 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.66008e-26 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 10x10 matrix, rank = 5

  20. Nonlinear Regression warning: matrix singular to machine precision, rcond = 5.8676e-19 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 5.86761e-19 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 8x8 matrix, rank = 7 Why does it fail? warning: matrix singular to machine precision, rcond = 1.10156e-21 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.10145e-21 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 9x9 matrix, rank = 6 warning: matrix singular to machine precision, rcond = 2.16217e-26 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.66008e-26 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 10x10 matrix, rank = 5

  21. Model Selection • Underfitting (model is too simple to explain data) • Overfitting (model is too complicated to learn from data) • E.g. too many parameters • Insufficient confidence to estimate parameter (failed matrix inverse) • Often training error decreases nonetheless • Model selection Need to quantify model complexity vs. data • This course - algorithms, model selection, questions

  22. Parzen Windows Parzen

  23. Density Estimation • Observe some data x i • Want to estimate p(x) • Find unusual observations (e.g. security) • Find typical observations (e.g. prototypes) • Classifier via Bayes Rule p ( y | x ) = p ( x, y ) p ( x | y ) p ( y ) = p ( x ) P y 0 p ( x | y 0 ) p ( y 0 ) • Need tool for computing p(x) easily

  24. Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 5 2 3 1 0 female 6 3 2 2 1

  25. Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

  26. Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

  27. Bin Counting • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female not enough data • Bin counting (record # of occurrences) 25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 0 female 0.24 0.12 0.08 0.08 0.04

  28. Curse of dimensionality (lite) • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female #bins grows • ZIP code exponentially • Day of the week • Operating system • ... • Continuous random variables • Income need many bins • Bandwidth per dimension • Time

  29. Curse of dimensionality (lite) • Discrete random variables, e.g. • English, Chinese, German, French, ... • Male, Female #bins grows • ZIP code exponentially • Day of the week • Operating system • ... • Continuous random variables • Income need many bins • Bandwidth per dimension • Time

  30. Density Estimation 0.10 0.05 sample underlying density 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110 • Continuous domain = infinite number of bins • Curse of dimensionality • 10 bins on [0, 1] is probably good • 10 10 bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by 10 10 .

  31. Bin Counting

  32. Bin Counting

  33. Bin Counting

  34. Bin Counting can’t we just go and smooth this out?

  35. Parzen Windows • Naive approach Use empirical density (delta distributions) m p emp ( x ) = 1 X δ x i ( x ) m i =1 • This breaks if we see slightly different instances • Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x’) satisfying Z k x ( x 0 ) dx 0 = 1 for all x X

  36. Parzen Windows • Density estimate m p emp ( x ) = 1 X δ x i ( x ) m i =1 m p ( x ) = 1 X ˆ k x i ( x ) m • Smoothing kernels i =1 1.0 1.0 1.0 1.0 Gauss Epanechikov Uniform Laplace 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 3 1 1 2 x 2 (2 π ) − 1 2 e − 1 2 e − | x | 4 max(0 , 1 − x 2 ) 2 χ [ − 1 , 1] ( x )

  37. Smoothing

  38. Smoothing dist = norm(X - x * ones(1,m),'columns'); p = (1/m) * ((2 * pi)**(-d/2)) * sum(exp(-0.5 * dist.**2))

  39. Smoothing

  40. Smoothing

  41. Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 0.3 1 3 10 0.10 0.05 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110

  42. Size matters Shape matters mostly in theory 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 ✓ x − x i ◆ • Kernel width k x i ( x ) = r − d h r • Too narrow overfits • Too wide smoothes with constant distribution • How to choose?

  43. Model Selection

  44. Maximum Likelihood • Need to measure how well we do • For density estimation we care about m Y Pr { X } = p ( x i ) i =1 • Finding a that maximizes P(X) will peak at all data points since x i explains x i best ... • Maxima are delta functions on data. • Overfitting!

  45. Overfitting 0.050 Likelihood on training set is 0.025 0.025 much higher than typical. 0.000 40 60 80 100

  46. Overfitting 0.050 density ≫ 0 Likelihood on training set is 0.025 0.025 much higher than typical. density 0 0.000 40 60 80 100

  47. Underfitting 0.050 Likelihood on training set is very similar to 0.025 typical one. Too simple. 0.000 40 60 80 100

Recommend


More recommend