Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13
Review 1 / 13
Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 1 / 13
Alternate View of PCA: Minimizing Reconstruction Error Assume that the data are centered . Find a line which minimizes the squared reconstruction error. 1 / 13
Alternate View: Minimizing Reconstruction Error with K -dim subspace. Equivalent (“dual”) formulation of PCA: find an “orthonormal basis” u 1 , u 2 , . . . u K which minimizes the total reconstruction error on the data: � 1 ( x i − Proj u 1 ,... u K ( x i )) 2 argmin N orthonormal basis: u 1 , u 2 ,... u K i Recall the projection of x onto K -orthonormal basis is: K � Proj u 1 ,... u K ( x ) = ( u i · x ) u i j =1 The SVD “simultaneously” finds all u 1 , u 2 , . . . u K 2 / 13
Projection and Reconstruction: the one dimensional case ◮ Take out mean µ : ◮ Find the “top” eigenvector u of the covariance matrix. ◮ What are your projections? ◮ What are your reconstructions, � x N ] ⊤ ? X = [ � x 1 | � x 2 | · · · | � ◮ What is your reconstruction error of doing nothing ( K = 0 ) and using K = 1 ? � � 1 1 ( x i − µ ) 2 = x i ) 2 = ( x i − � N N i i ◮ Reduction in error by using a k -dim PCA projection: 3 / 13
PCA vs. Clustering Summarize your data with fewer points or fewer dimensions ? 3 / 13
Loss functions 3 / 13
Today 3 / 13
Perceptron Perceptron Algorithm: A model and an algorithm, rolled into one. Isn’t there a more principled methodology to derive algorithms? 3 / 13
What we (“naively”) want: “Minimize training-set error rate”: loss � N 1 min � y n ( w · x n + b ) ≤ 0 � � �� � N w ,b n =1 zero-one loss on a point n margin = y · ( w · x + b ) This problem is NP-hard; even for a (multiplicative) approximation. Why is this loss function so unwieldy? 4 / 13
Relax! ◮ The mis-classification optimization problem: � N 1 min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, let’s try to choose a “reasonable” loss function ℓ ( y n , w · x ) and then try to solve the relaxation: N � 1 min ℓ ( y n , w · x n ) N w n =1 5 / 13
What is a good “relaxation”? ◮ Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. ◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ : � y ( w · x ) ≤ 0 � ≤ ℓ ( y, w · x ) ◮ want our relaxed optimization problem to be easy to solve. What properties might we want for ℓ ( · ) ? 6 / 13
What is a good “relaxation”? ◮ Want that minimizing our surrogate loss helps with minimizing the mis-classification loss. ◮ idea: try to use a (sharp) upper bound of the zero-one loss by ℓ : � y ( w · x ) ≤ 0 � ≤ ℓ ( y, w · x ) ◮ want our relaxed optimization problem to be easy to solve. What properties might we want for ℓ ( · ) ? ◮ differentiable? sensitive to changes in w ? ◮ convex? 6 / 13
The square loss! (and linear regression) ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 . ◮ The relaxed optimization problem: N � 1 ( y n − w · x n ) 2 min N w n =1 ◮ nice properties: ◮ for binary classification, it is a an upper bound on the zero-one loss. ◮ It makes sense more generally, e.g. if we want to predict real valued y . ◮ We have a convex optimization problem. ◮ For classification, what is your decision rule using a w ? 7 / 13
The square loss as an upper bound ◮ We have: � y ( w · x ) ≤ 0 � ≤ ( y − w · x ) 2 ◮ Easy to see, by plotting: 8 / 13
Remember this problem? Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. 9 / 13
Remember this problem? Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given the input row. Predicting a real y (often) makes more sense. 9 / 13
A better (convex) upper bound ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ We have: � y ( w · x ) ≤ 0 � ≤ constant ∗ ℓ logistic ( y, w · x ) ◮ Again, easy to see, by plotting: 10 / 13
Least squares: let’s minimize it! ◮ The optimization problem: � N 1 ( y n − w · x n ) 2 = min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ How do we interpret X w ? 11 / 13
Least squares: let’s minimize it! ◮ The optimization problem: � N 1 ( y n − w · x n ) 2 = min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ How do we interpret X w ? The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 11 / 13
Matrix calculus proof: scratch space 12 / 13
Matrix calculus proof: scratch space 12 / 13
Remember your linear system solving! 12 / 13
Lots of questions: ◮ What could go wrong with least squares? ◮ Suppose we are in “high dimensions”: more dimensions than data points. ◮ Inductive bias: we need a way to control the complexity of the model. ◮ How do we minimize (sum) logistic loss? ◮ Optimization: how do we do this all quickly? 13 / 13
Recommend
More recommend