Supervised Learning II Cameron Allen csal@brown.edu Fall 2019

Machine Learning Subfield of AI concerned with learning from data . Broadly, using: • Experience • To Improve Performance • On Some Task (Tom Mitchell, 1997)

Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y? “Not Hotdog”, SeeFood Technologies Inc.

Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

Supervised Learning Formal definition: Given training data: inputs X = {x 1 , …, x n } Y = {y 1 , …, y n } labels Produce: Decision function f : X → Y That minimizes error: X err ( f ( x i ) , y i ) i

Neural Networks σ ( w · x + c ) logistic regression

Deep Neural Networks o 1 o 2 h n1 h n2 h n3 …. h 11 h 12 h 13 x 1 x 2

Nonparametric Methods Most ML methods are parametric: • Characterized by setting a few parameters. • y = f ( x, w ) Alternative approach: • Let the data speak for itself. • No finite-sized parameter vector. • Usually more interesting decision boundaries.

K-Nearest Neighbors Given training data: X = {x 1 , …, x n } Y = {y 1 , …, y n } Distance metric D(x i , x j ) For a new data point x new : find k nearest points in X (measured via D) set y new to the majority label

K-Nearest Neighbors + + + + - - - + - + + - ++ + - + - -

K-Nearest Neighbors Decision boundary … what if k=1? + + + + - - - + - + + - ++ + - + - -

K-Nearest Neighbors Properties: • No learning phase. • Must store all the data. • log(n) computation per sample - grows with data . Decision boundary: • any function, given enough data. Classic trade-off: memory and compute time for flexibility.

Applications • Fraud detection • Internet advertising • Friend or link prediction • Sentiment analysis • Face recognition • Spam filtering

Applications MNIST Data Set Training set: 60k digits Test set: 10k digits

Classification vs. Regression If the set of labels Y is discrete: • Classification • Minimize number of errors If Y is real-valued: • Regression • Minimize sum squared error Let’s look at regression.

Regression with Decision Trees Start with decision trees with real-valued inputs. a > 3.1 false true y=1 b < 0.6? false true y=2 y=1

Regression with Decision Trees … now real-valued outputs. a > 3.1 false true y=0.6 b < 0.6? false true y=0.3 y=1.1

Regression with Decision Trees Training procedure - fix a depth, k . If you have k=1, fit the average. If k > 1: Consider all variables to split on Find the one that minimizes SSE Recurse ( k-1 ) What happens if k = N ?

Regression with Decision Trees (via scikit-learn docs)

Linear Regression Alternatively, explicit equation for prediction. Recall the Perceptron. If x = [ x(1), … , x(n) ]: + • Create an n -d line + + - + • Slope for each x(i) - • Constant offset + - + f ( x ) = sign ( w · x + c ) - gradient offset

Linear Regression Directly represent f as a linear function: • f ( x, w ) = w · x + c What can be represented this way? y x 2 x 1

Linear Regression How to train? Given inputs: • x = [ x 1 , …, x n ] (each x i is a vector, first element = 1) • y = [ y 1 , …, y n ] (each y i is a real number) Define error function: Minimize summed squared error n X ( w · x i − y i ) 2 i =1

Linear Regression The usual story: • Set derivative of error function to zero. n d ( w · x i − y i ) 2 = 0 X dw n ! i =1 X x T A = i x i n X ( w · x i − y i ) x T 2 i = 0 i =1 matrix i =1 n ! n n X X x T x T i x i w = i y i X x T b = i y i i =1 i =1 i =1 vector w = A − 1 b

Polynomial Regression More powerful: • Polynomials in state variables. • 1st order: [1 , x, y, xy ] • 2nd order: [1 , x, y, xy, x 2 , y 2 , x 2 y, y 2 x, x 2 y 2 ] • y i = w · Φ ( x i ) What can be represented?

Polynomial Regression As before … n d X ( w · Φ ( x i ) − y i ) 2 dw i =1 n Φ T ( x i ) Φ ( x i ) X A = i =1 w = A − 1 b n Φ T ( x i ) y i X b = i =1

Polynomial Regression (wikipedia)

Overfitting

Ridge Regression A characteristic of overfitting: • Very large weights. Modify the objective function to discourage this: n ( w · x i − y i ) 2 + λ || w || X min i =1 error term regularization term � − 1 A T b A T A + Λ T Λ � w =

Neural Network Regression σ ( w · x + c ) classification

Neural Network Regression output layer o 1 o 2 hidden layer h 1 h 2 h 3 input layer x 1 x 2

Neural Network Regression σ ( w o 2 1 h 1 + w o 2 2 h 2 + w o 2 3 h 3 + w o 2 4 ) σ ( w o 1 1 h 1 + w o 1 2 h 2 + w o 1 3 h 3 + w o 1 4 ) value computed o 1 o 2 feed forward h 1 h 2 h 3 σ ( w h 2 1 x 1 + w h 2 2 x 2 + w h 2 σ ( w h 3 1 x 1 + w h 3 2 x 2 + w h 3 3 ) 3 ) value computed h 1 = σ ( w h 1 1 x 1 + w h 1 2 x 2 + w h 1 x 1 x 2 3 ) input layer x 1 , x 2 ∈ [0 , 1]

Neural Network Regression A neural network is just a parametrized function: y = f ( x, w ) How to train it? Write down an error function: ( y i − f ( x i , w )) 2 Minimize it! (w.r.t. w ) No closed form solution to gradient = 0. Hence, stochastic gradient descent: d •Compute dw ( y i − f ( x i , w )) 2 •Descend

Image Colorization (Zhang, Isola, Efros, 2016)

Nonparametric Regression Most ML methods are parametric: • Characterized by setting a few parameters. • y = f ( x, w ) Alternative approach: • Let the data speak for itself. • No finite-sized parameter vector. • Usually more interesting decision boundaries.

Nonparametric Regression What’s the regression equivalent of k -Nearest Neighbors? Given training data: X = {x 1 , …, x n } Y = {y 1 , …, y n } Distance metric D(x i , x j ) For a new data point x new : find k nearest points in X (measured via D) set y new to the (weighted by D) average y i labels

Nonparametric Regression As k increases, f gets smoother.

Gaussian Processes

Applications model and predict variations in pH, clay, and sand content in the topsoil (Gonzalez et al., 2007)

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 - PowerPoint PPT Presentation

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) Supervised

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Self-Supervised Feature Learning by Learning to Spot Artifacts Wonbin Kim Self-Supervised

RightScale Provider The Rightscale provider is used to interact with the the RightScale Cloud

Math 211 Math 211 Review for the Final Exam December 8, 2002 2 The Final Exam The Final Exam

The Yang of Things (YoT) Andy Bierman Michel Veille1e Peter van der Stok Alexander Pelov

Chapter 6 The Data Link layer 6.1 introduction, 6.5 link virtualization: services MPLS 6.2

File class in Java n Programmers refer to input/output as "I/O". n The File class

Whats going on here? Results from multiple runs of the same program: Flipping a coin: Heads!

VOTD: OS Command Injection Engineering Secure Software Last Revised: September 17, 2020

Week 13 - Friday What did we talk about last time? Finished inheritance example Files

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 - PowerPoint PPT Presentation

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) Supervised

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Self-Supervised Feature Learning by Learning to Spot Artifacts Wonbin Kim Self-Supervised

RightScale Provider The Rightscale provider is used to interact with the the RightScale Cloud

Math 211 Math 211 Review for the Final Exam December 8, 2002 2 The Final Exam The Final Exam

The Yang of Things (YoT) Andy Bierman Michel Veille1e Peter van der Stok Alexander Pelov

Chapter 6 The Data Link layer 6.1 introduction, 6.5 link virtualization: services MPLS 6.2

File class in Java n Programmers refer to input/output as &quot;I/O&quot;. n The File class

Whats going on here? Results from multiple runs of the same program: Flipping a coin: Heads!

VOTD: OS Command Injection Engineering Secure Software Last Revised: September 17, 2020

Week 13 - Friday What did we talk about last time? Finished inheritance example Files

File class in Java n Programmers refer to input/output as "I/O". n The File class