Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the - PowerPoint PPT Presentation

Machine Learning Theory CS 446

1. SVM risk

SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . 1 / 22

SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . What’s going on here? 1 / 22

SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . What’s going on here? (I just tricked you into caring about theory.) 1 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . 2 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? 2 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) 2 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ f = arg min f ∈ F R ( f ) ? 2 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈ F R ( f ) ? Answer: no; in general � f ) !) 2 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈ F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) 2 / 22

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈ F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) Let’s go through this step by step. 2 / 22

Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) 3 / 22

Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) � If g is the function with lowest classi fi cation error, we can write down an explicit form: g ( x ) := sign(Pr[ Y = +1 | X = x ] − 1 / 2) . � If g minimizes R with convex � , again can write down g pointwise via Pr[ Y = +1 | X = x ] . 3 / 22

Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) 4 / 22

Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) � We’ve shown that if R is misclassi fi cation, F is a ffi ne classi fi er, g is quadratic, can have gap 1 / 4 . � We can make this gap arbitrarily small if F is: 2 layer wide network, RBF kernel SVM, polynomial classi fi er with arbitrary degree . . . � What is F for SVM? 4 / 22

Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? 5 / 22

Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? w := arg min w � R ( w ) + λ 2 � w � 2 , Note, for ˆ n � � � 2 � 0 � 2 = 1 λ w ) + λ R ( 0 ) + λ w � 2 ≤ � w � 2 ≤ � T x i y i 2 � ˆ R ( ˆ 2 � ˆ 1 − 0 + = 1 , n i =1 and so SVM is working with the fi ner set � � T x : � w � 2 ≤ 2 F λ := w �→ w . λ 5 / 22

Approximation gap What about kernel SVM? 6 / 22

Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. 6 / 22

Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a re fi ned notion F k, λ . 6 / 22

Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a re fi ned notion F k, λ . Going forward: we always try to work with the tightest possible function class de fi ned by the data and algorithm. 6 / 22

Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) 7 / 22

Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) � If (( x i , y i )) n i =1 drawn IID from same distribution as E in R , R ( ¯ n →∞ R ( ¯ by central limit theorem, � f ) − − − − → f ) . � Next week, we’ll discuss high probability bounds for fi nite n . 7 / 22

Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) 8 / 22

Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) � This is algorithmic : we reduce this number by optimizing better. � We’ve advocated the use of gradient descent. � Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ � If � f has at least one training mistake, relating R and test set mislcassi fi cations can be painful. 8 / 22

Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) � This is algorithmic : we reduce this number by optimizing better. � We’ve advocated the use of gradient descent. � Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ � If � f has at least one training mistake, relating R and test set mislcassi fi cations can be painful. Speci fi cally considering SVM. � This is a convex optimization problem . � We can solve it in many ways (primal, dual, projected gradient descent, coordinate descent, Newton, et.), it doesn’t really matter so long as we end up close; the primal solutions are unique. 8 / 22

Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we over fi t. 9 / 22

Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we over fi t. R ( ¯ n →∞ R ( ¯ � Before, we said “By CLT, � f ) − − − − → f ) ”. Is this quantity the same? 9 / 22

Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we over fi t. R ( ¯ n →∞ R ( ¯ � Before, we said “By CLT, � f ) − − − − → f ) ”. Is this quantity the same? � No! ˆ f is a random variable! 9 / 22

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the - PowerPoint PPT Presentation

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population risk of SVM: given f , R ( f ) = 1 ( Y ( y i R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40.

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning Slides

Computational Learning Theory: Positive and negative learnability results Machine Learning 1

Applied machine learning in game theory Dmitrijs Rutko Faculty of Computing University of

Computational Learning Theory: Shattering and VC Dimensions Machine Learning 1 Slides based on

CS260: Machine Learning Theory Lecture 1: Course Introduction Jenn Wortman Vaughan September 26,

Computational Learning Theory - HT 2018 formerly Advanced Machine Learning (HT 2017) Introduction

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from

Statistical Learning Theory Machine Learning Summer School, Kyoto, Japan Alexander (Sasha)

Structure of Vision Problems Alan Yuille (UCLA). Machine Learning Theory of Machine Learning

PAC Learning Learning Theory Readings: Matt Gormley Murphy -- Bishop

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Machine learning theory for time series Exponential inequalities for nonstationary Markov chains

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

Probability Theory CMPUT 296: Basics of Machine Learning 2.1-2.2 Recap This class is about

Lecture 1. Introduction. Probability Theory COMP90051 Machine Learning Sem2 2017 Lecturer:

What is Machine Learning? 1 Our goal today And through the semester What is (machine) learning?

Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1

An Exercise in An Exercise in Machine Learning Machine Learning

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016