Machine Learning (CSE 446): Learning as Minimizing Loss; Least - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13

Review 1 / 13

Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 1 / 13

Alternate View: Minimizing Reconstruction Error with K -dim subspace. Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u 1 , u 2 , . . . u K which minimizes the total reconstruction error on the data: � 1 ( x i − Proj u 1 ,... u K ( x i )) 2 argmin N orthonormal basis: u 1 , u 2 ,... u K i Recall the projection of x onto K -orthonormal basis is: K � Proj u 1 ,... u K ( x ) = ( u i · x ) u i j =1 The SVD “simultaneously” finds all u 1 , u 2 , . . . u K 2 / 13

Projection and Reconstruction: the one dimensional case ◮ Take out mean µ : ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions, � x N ] ⊤ ? X = [ � x 1 | � x 2 | · · · | � ◮ What is your reconstruction error of doing nothing ( K = 0 ) and using K = 1 ? � � 1 1 ( x i − µ ) 2 = x i ) 2 = ( x i − � N N i i ◮ Reduction in error by using a k -dim PCA projection: 3 / 13

PCA vs. Clustering Summarize your data with fewer points or fewer dimensions ? 3 / 13

Loss functions 3 / 13

Today 3 / 13

Perceptron Perceptron Algorithm: A model and an algorithm, rolled into one. Isn’t there a more principled methodology to derive algorithms? 3 / 13

What we (“naively”) want: “Minimize training-set error rate”: loss � N 1 min � y n ( w · x n + b ) ≤ 0 � � �� N w ,b n =1 zero-one loss on a point n margin = y · ( w · x + b ) This problem is NP-hard; even for a (multiplicative) approximation. Why is this loss function so unwieldy? 4 / 13

Relax! ◮ The mis-classification optimization problem: � N 1 min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, let’s try to choose a “reasonable” loss function ℓ ( y n , w · x ) and then try to solve the relaxation: N � 1 min ℓ ( y n , w · x n ) N w n =1 5 / 13

What is a good “relaxation”? ◮ Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. ◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ : � y ( w · x ) ≤ 0 � ≤ ℓ ( y, w · x ) ◮ want our relaxed optimization problem to be easy to solve. What properties might we want for ℓ ( · ) ? 6 / 13

What is a good “relaxation”? ◮ Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. ◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ : � y ( w · x ) ≤ 0 � ≤ ℓ ( y, w · x ) ◮ want our relaxed optimization problem to be easy to solve. What properties might we want for ℓ ( · ) ? ◮ differentiable? sensitive to changes in w ? ◮ convex? 6 / 13

The square loss! (and linear regression) ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 . ◮ The relaxed optimization problem: N � 1 ( y n − w · x n ) 2 min N w n =1 ◮ nice properties: ◮ for binary classification, it is a an upper bound on the zero-one loss. ◮ It makes sense more generally, e.g. if we want to predict real valued y . ◮ We have a convex optimization problem. ◮ For classification, what is your decision rule using a w ? 7 / 13

The square loss as an upper bound ◮ We have: � y ( w · x ) ≤ 0 � ≤ ( y − w · x ) 2 ◮ Easy to see, by plotting: 8 / 13

Remember this problem? Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. 9 / 13

Remember this problem? Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. Predicting a real y (often) makes more sense. 9 / 13

A better (convex) upper bound ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ We have: � y ( w · x ) ≤ 0 � ≤ constant ∗ ℓ logistic ( y, w · x ) ◮ Again, easy to see, by plotting: 10 / 13

Least squares: let’s minimize it! ◮ The optimization problem: � N 1 ( y n − w · x n ) 2 = min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ How do we interpret X w ? 11 / 13

Least squares: let’s minimize it! ◮ The optimization problem: � N 1 ( y n − w · x n ) 2 = min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ How do we interpret X w ? The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 11 / 13

Matrix calculus proof: scratch space 12 / 13

Remember your linear system solving! 12 / 13

Lots of questions: ◮ What could go wrong with least squares? ◮ Suppose we are in “high dimensions”: more dimensions than data points. ◮ Inductive bias: we need a way to control the complexity of the model. ◮ How do we minimize (sum) logistic loss? ◮ Optimization: how do we do this all quickly? 13 / 13

Machine Learning (CSE 446): Learning as Minimizing Loss; Least - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing Reconstruction Error Assume that

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Overview CS 446 What is machine learning? Machine learning : study of computational

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Simple Linear Regression and Correlation Model for designed experiment: Y i = 0 + 1 x i +

JUST THE MATHS SLIDES NUMBER 18.4 STATISTICS 4 (The principle of least squares) by

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department

Ordinary Least Squares for Histogram Data based on Wasserstein Distance Rosanna Verde Antonio

Optimizing pred(25) Is Problem NP-Hard Main Result Acknowledgments Martine Ceberio, Olga

A least squares approach for the Discretizable Distance Geometry Problem with inexact distances

Why adjoint based least squares solving ought to be optimal Andreas Griewank Department of

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Machine Learning (CSE 446): Learning as Minimizing Loss; Least - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing Reconstruction Error Assume that

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Machine Learning (CSE 446): PCA (continued) and Learning as Minimizing Loss Sham M Kakade

Machine Learning (CSE 446): Probabilistic Machine Learning MLE &amp; MAP Sham M Kakade 2018

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Ar*ficial Intelligence CSCI 446: Ar*ficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Machine Learning (CSE 446): Concepts &amp; the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &amp;) Limits of Learning Sham M Kakade

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Overview CS 446 What is machine learning? Machine learning : study of computational

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Simple Linear Regression and Correlation Model for designed experiment: Y i = 0 + 1 x i +

JUST THE MATHS SLIDES NUMBER 18.4 STATISTICS 4 (The principle of least squares) by

COMS 4721: Machine Learning for Data Science Lecture 2, 1/19/2017 Prof. John Paisley Department

Ordinary Least Squares for Histogram Data based on Wasserstein Distance Rosanna Verde Antonio

Optimizing pred(25) Is Problem NP-Hard Main Result Acknowledgments Martine Ceberio, Olga

A least squares approach for the Discretizable Distance Geometry Problem with inexact distances

Why adjoint based least squares solving ought to be optimal Andreas Griewank Department of

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade