machine learning theory
play

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff - PowerPoint PPT Presentation

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C SVM risk Consider the


  1. Machine Learning Theory CS 446

  2. 1. SVM risk

  3. 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 1 / 61

  4. SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C 1 / 61

  5. SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C What’s going on here? 1 / 61

  6. SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C What’s going on here? (I just tricked you into caring about theory.) 1 / 61

  7. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . 2 / 61

  8. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? 2 / 61

  9. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) 2 / 61

  10. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ f = arg min f ∈F R ( f ) ? 2 / 61

  11. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈F R ( f ) ? Answer: no; in general � f ) !) 2 / 61

  12. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) 2 / 61

  13. Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) Let’s go through this step by step. 2 / 61

  14. Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) 3 / 61

  15. Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) ◮ If g is the function with lowest classification error, we can write down an explicit form: g ( x ) := sign(Pr[ Y = +1 | X = x ] − 1 / 2) . ◮ If g minimizes R with convex ℓ , again can write down g pointwise via Pr[ Y = +1 | X = x ] . 3 / 61

  16. Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) 4 / 61

  17. Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) ◮ We’ve shown that if R is misclassification, F is affine classifier, g is quadratic, can have gap 1 / 4 . ◮ We can make this gap arbitrarily small if F is: 2 layer wide network, RBF kernel SVM, polynomial classifier with arbitrary degree . . . ◮ What is F for SVM? 4 / 61

  18. Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? 5 / 61

  19. Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? w := arg min w � R ( w ) + λ 2 � w � 2 , Note, for ˆ n � � � λ w ) + λ R ( 0 ) + λ 2 � 0 � 2 = 1 w � 2 ≤ � w � 2 ≤ � T x i y i 2 � ˆ R ( ˆ 2 � ˆ 1 − 0 + = 1 , n i =1 and so SVM is working with the finer set � � T x : � w � 2 ≤ 2 F λ := w �→ w . λ 5 / 61

  20. Approximation gap What about kernel SVM? 6 / 61

  21. Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. 6 / 61

  22. Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion F k,λ . 6 / 61

  23. Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion F k,λ . Going forward: we always try to work with the tightest possible function class defined by the data and algorithm. 6 / 61

  24. Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) 7 / 61

  25. Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) ◮ If (( x i , y i )) n i =1 drawn IID from same distribution as E in R , R ( ¯ n →∞ R ( ¯ by central limit theorem, � f ) − − − − → f ) . ◮ Next week, we’ll discuss high probability bounds for finite n . 7 / 61

  26. Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) 8 / 61

  27. Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) ◮ This is algorithmic : we reduce this number by optimizing better. ◮ We’ve advocated the use of gradient descent. ◮ Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ ◮ If � f has at least one training mistake, relating R and test set mislcassifications can be painful. 8 / 61

  28. Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) ◮ This is algorithmic : we reduce this number by optimizing better. ◮ We’ve advocated the use of gradient descent. ◮ Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ ◮ If � f has at least one training mistake, relating R and test set mislcassifications can be painful. Specifically considering SVM. ◮ This is a convex optimization problem . ◮ We can solve it in many ways (primal, dual, projected gradient descent, coordinate descent, Newton, et.), it doesn’t really matter so long as we end up close; the primal solutions are unique. 8 / 61

  29. Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. 9 / 61

  30. Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. R ( ¯ n →∞ R ( ¯ ◮ Before, we said “By CLT, � f ) − − − − → f ) ”. Is this quantity the same? 9 / 61

Recommend


More recommend