Machine Learning Theory CS 446
1. SVM risk
SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . 1 / 22
SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . 1 / 22
SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . What’s going on here? 1 / 22
SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 � ( Y ˆ � � ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore de fi ne excess risk R ( f ) − � R ( f ) . What’s going on here? (I just tricked you into caring about theory.) 1 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . 2 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? 2 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) 2 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ f = arg min f ∈ F R ( f ) ? 2 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈ F R ( f ) ? Answer: no; in general � f ) !) 2 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈ F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) 2 / 22
Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈ F � R ( f ) . Let’s also de fi ne true/population risk minimizer ¯ f := arg min f ∈ F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈ F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) Let’s go through this step by step. 2 / 22
Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) 3 / 22
Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) � If g is the function with lowest classi fi cation error, we can write down an explicit form: g ( x ) := sign(Pr[ Y = +1 | X = x ] − 1 / 2) . � If g minimizes R with convex � , again can write down g pointwise via Pr[ Y = +1 | X = x ] . 3 / 22
Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) 4 / 22
Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) � We’ve shown that if R is misclassi fi cation, F is a ffi ne classi fi er, g is quadratic, can have gap 1 / 4 . � We can make this gap arbitrarily small if F is: 2 layer wide network, RBF kernel SVM, polynomial classi fi er with arbitrary degree . . . � What is F for SVM? 4 / 22
Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? 5 / 22
Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? w := arg min w � R ( w ) + λ 2 � w � 2 , Note, for ˆ n � � � 2 � 0 � 2 = 1 λ w ) + λ R ( 0 ) + λ w � 2 ≤ � w � 2 ≤ � T x i y i 2 � ˆ R ( ˆ 2 � ˆ 1 − 0 + = 1 , n i =1 and so SVM is working with the fi ner set � � T x : � w � 2 ≤ 2 F λ := w �→ w . λ 5 / 22
Approximation gap What about kernel SVM? 6 / 22
Approximation gap What about kernel SVM? Now working with n � α i y i k ( x i , x ) : α ∈ R n F k := x �→ i =1 which is a random variable! (( x i , y i )) n i =1 given by data. 6 / 22
Approximation gap What about kernel SVM? Now working with n � α i y i k ( x i , x ) : α ∈ R n F k := x �→ i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a re fi ned notion F k, λ . 6 / 22
Approximation gap What about kernel SVM? Now working with n � α i y i k ( x i , x ) : α ∈ R n F k := x �→ i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a re fi ned notion F k, λ . Going forward: we always try to work with the tightest possible function class de fi ned by the data and algorithm. 6 / 22
Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) 7 / 22
Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) � If (( x i , y i )) n i =1 drawn IID from same distribution as E in R , R ( ¯ n →∞ R ( ¯ by central limit theorem, � f ) − − − − → f ) . � Next week, we’ll discuss high probability bounds for fi nite n . 7 / 22
Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) 8 / 22
Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) � This is algorithmic : we reduce this number by optimizing better. � We’ve advocated the use of gradient descent. � Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ � If � f has at least one training mistake, relating R and test set mislcassi fi cations can be painful. 8 / 22
Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) � This is algorithmic : we reduce this number by optimizing better. � We’ve advocated the use of gradient descent. � Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ � If � f has at least one training mistake, relating R and test set mislcassi fi cations can be painful. Speci fi cally considering SVM. � This is a convex optimization problem . � We can solve it in many ways (primal, dual, projected gradient descent, coordinate descent, Newton, et.), it doesn’t really matter so long as we end up close; the primal solutions are unique. 8 / 22
Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we over fi t. 9 / 22
Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we over fi t. R ( ¯ n →∞ R ( ¯ � Before, we said “By CLT, � f ) − − − − → f ) ”. Is this quantity the same? 9 / 22
Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we over fi t. R ( ¯ n →∞ R ( ¯ � Before, we said “By CLT, � f ) − − − − → f ) ”. Is this quantity the same? � No! ˆ f is a random variable! 9 / 22
Recommend
More recommend