supervised learning ii
play

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 - PowerPoint PPT Presentation

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) Supervised


  1. Supervised Learning II Cameron Allen csal@brown.edu Fall 2019

  2. Machine Learning Subfield of AI concerned with learning from data . Broadly, using: • Experience • To Improve Performance • On Some Task (Tom Mitchell, 1997)

  3. Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

  4. Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y? “Not Hotdog”, SeeFood Technologies Inc.

  5. Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

  6. Supervised Learning Formal definition: Given training data: inputs X = {x 1 , …, x n } Y = {y 1 , …, y n } labels Produce: Decision function f : X → Y That minimizes error: X err ( f ( x i ) , y i ) i

  7. Neural Networks σ ( w · x + c ) logistic regression

  8. Deep Neural Networks o 1 o 2 h n1 h n2 h n3 …. h 11 h 12 h 13 x 1 x 2

  9. Nonparametric Methods Most ML methods are parametric: • Characterized by setting a few parameters. • y = f ( x, w ) Alternative approach: • Let the data speak for itself. • No finite-sized parameter vector. • Usually more interesting decision boundaries.

  10. K-Nearest Neighbors Given training data: X = {x 1 , …, x n } Y = {y 1 , …, y n } Distance metric D(x i , x j ) For a new data point x new : find k nearest points in X (measured via D) set y new to the majority label

  11. K-Nearest Neighbors + + + + - - - + - + + - ++ + - + - -

  12. K-Nearest Neighbors Decision boundary … what if k=1? + + + + - - - + - + + - ++ + - + - -

  13. K-Nearest Neighbors Properties: • No learning phase. • Must store all the data. • log(n) computation per sample - grows with data . Decision boundary: • any function, given enough data. Classic trade-off: memory and compute time for flexibility.

  14. Applications • Fraud detection • Internet advertising • Friend or link prediction • Sentiment analysis • Face recognition • Spam filtering

  15. Applications MNIST Data Set Training set: 60k digits Test set: 10k digits

  16. Classification vs. Regression If the set of labels Y is discrete: • Classification • Minimize number of errors If Y is real-valued: • Regression • Minimize sum squared error Let’s look at regression.

  17. Regression with Decision Trees Start with decision trees with real-valued inputs. a > 3.1 false true y=1 b < 0.6? false true y=2 y=1

  18. Regression with Decision Trees … now real-valued outputs. a > 3.1 false true y=0.6 b < 0.6? false true y=0.3 y=1.1

  19. Regression with Decision Trees Training procedure - fix a depth, k . If you have k=1, fit the average. If k > 1: Consider all variables to split on Find the one that minimizes SSE Recurse ( k-1 ) What happens if k = N ?

  20. Regression with Decision Trees (via scikit-learn docs)

  21. Linear Regression Alternatively, explicit equation for prediction. Recall the Perceptron. If x = [ x(1), … , x(n) ]: + • Create an n -d line + + - + • Slope for each x(i) - • Constant offset + - + f ( x ) = sign ( w · x + c ) - gradient offset

  22. Linear Regression Directly represent f as a linear function: • f ( x, w ) = w · x + c What can be represented this way? y x 2 x 1

  23. Linear Regression How to train? Given inputs: • x = [ x 1 , …, x n ] (each x i is a vector, first element = 1) • y = [ y 1 , …, y n ] (each y i is a real number) Define error function: Minimize summed squared error n X ( w · x i − y i ) 2 i =1

  24. Linear Regression The usual story: • Set derivative of error function to zero. n d ( w · x i − y i ) 2 = 0 X dw n ! i =1 X x T A = i x i n X ( w · x i − y i ) x T 2 i = 0 i =1 matrix i =1 n ! n n X X x T x T i x i w = i y i X x T b = i y i i =1 i =1 i =1 vector w = A − 1 b

  25. Polynomial Regression More powerful: • Polynomials in state variables. • 1st order: [1 , x, y, xy ] • 2nd order: [1 , x, y, xy, x 2 , y 2 , x 2 y, y 2 x, x 2 y 2 ] • y i = w · Φ ( x i ) What can be represented?

  26. Polynomial Regression As before … n d X ( w · Φ ( x i ) − y i ) 2 dw i =1 n Φ T ( x i ) Φ ( x i ) X A = i =1 w = A − 1 b n Φ T ( x i ) y i X b = i =1

  27. Polynomial Regression (wikipedia)

  28. Overfitting

  29. Overfitting

  30. Ridge Regression A characteristic of overfitting: • Very large weights. Modify the objective function to discourage this: n ( w · x i − y i ) 2 + λ || w || X min i =1 error term regularization term � − 1 A T b A T A + Λ T Λ � w =

  31. Neural Network Regression σ ( w · x + c ) classification

  32. Neural Network Regression output layer o 1 o 2 hidden layer h 1 h 2 h 3 input layer x 1 x 2

  33. Neural Network Regression σ ( w o 2 1 h 1 + w o 2 2 h 2 + w o 2 3 h 3 + w o 2 4 ) σ ( w o 1 1 h 1 + w o 1 2 h 2 + w o 1 3 h 3 + w o 1 4 ) value computed o 1 o 2 feed forward h 1 h 2 h 3 σ ( w h 2 1 x 1 + w h 2 2 x 2 + w h 2 σ ( w h 3 1 x 1 + w h 3 2 x 2 + w h 3 3 ) 3 ) value computed h 1 = σ ( w h 1 1 x 1 + w h 1 2 x 2 + w h 1 x 1 x 2 3 ) input layer x 1 , x 2 ∈ [0 , 1]

  34. Neural Network Regression A neural network is just a parametrized function: y = f ( x, w ) How to train it? Write down an error function: ( y i − f ( x i , w )) 2 Minimize it! (w.r.t. w ) No closed form solution to gradient = 0. Hence, stochastic gradient descent: d •Compute dw ( y i − f ( x i , w )) 2 •Descend

  35. Image Colorization (Zhang, Isola, Efros, 2016)

  36. Nonparametric Regression Most ML methods are parametric: • Characterized by setting a few parameters. • y = f ( x, w ) Alternative approach: • Let the data speak for itself. • No finite-sized parameter vector. • Usually more interesting decision boundaries.

  37. Nonparametric Regression What’s the regression equivalent of k -Nearest Neighbors? Given training data: X = {x 1 , …, x n } Y = {y 1 , …, y n } Distance metric D(x i , x j ) For a new data point x new : find k nearest points in X (measured via D) set y new to the (weighted by D) average y i labels

  38. Nonparametric Regression As k increases, f gets smoother.

  39. Gaussian Processes

  40. Applications model and predict variations in pH, clay, and sand content in the topsoil (Gonzalez et al., 2007)

Recommend


More recommend