statistical learning theory
play

Statistical Learning Theory Machine Learning Summer School, Kyoto, - PowerPoint PPT Presentation

Statistical Learning Theory Machine Learning Summer School, Kyoto, Japan Alexander (Sasha) Rakhlin University of Pennsylvania, The Wharton School Penn Research in Machine Learning (PRiML) August 27-28, 2012 1 / 130 References Parts of these


  1. Statistical Learning Theory Upon observing the training data {( x 1 , y 1 ) , . . . , ( x n , y n )} , the learner is asked to summarize what she had learned about the relationship between x and y . f n ∶ X ↦ Y . The hat The learner’s summary takes the form of a function ˆ indicates that this function depends on the training data. Learning algorithm : a mapping {( x 1 , y 1 ) , . . . , ( x n , y n )} � → ˆ f n . The quality of the learned relationship is given by comparing the response ˆ f n ( x ) to y for a pair ( x , y ) independently drawn from the same distribution P : E ( x , y ) ℓ ( ˆ f n ( x ) , y ) where ℓ ∶ Y × Y ↦ R is a loss function . This is our measure of performance. 21 / 130

  2. Loss Functions ▸ Indicator loss (classification): ℓ ( y , y ′ ) = I { y ≠ y ′ } ▸ Square loss: ℓ ( y , y ′ ) = ( y − y ′ ) 2 ▸ Absolute loss: ℓ ( y , y ′ ) = ∣ y − y ′ ∣ 22 / 130

  3. Examples Probably the simplest learning algorithm that you are probably familiar with is linear least squares : Given ( x 1 , y 1 ) , . . . , ( x n , y n ) , let ∑ n ( y i − ⟨ β , x i ⟩) 2 1 ˆ β = arg min n β ∈ R d i = 1 and define f n ( x ) = ⟨ ˆ β , x ⟩ ˆ Another basic method is regularized least squares : ( y i − ⟨ β , x i ⟩) 2 + λ ∥ β ∥ 2 n 1 ∑ ˆ β = arg min n β ∈ R d i = 1 23 / 130

  4. Methods vs Problems Algorithms ˆ Distributions P f n 24 / 130

  5. Expected Loss and Empirical Loss The expected loss of any function f ∶ X ↦ Y is L ( f ) = E ℓ ( f ( x ) , y ) Since P is unknown, we cannot calculate L ( f ) . However, we can calculate the empirical loss of f ∶ X ↦ Y L ( f ) = 1 n ℓ ( f ( x i ) , y i ) ∑ ˆ n i = 1 25 / 130

  6. ... again, what is random here? Since data ( x 1 , y 1 ) , . . . , ( x n , y n ) are a random i.i.d. draw from P , L ( f ) is a random quantity ▸ ˆ ▸ ˆ f n is a random quantity (a random function, output of our learning procedure after seeing data) ▸ hence, L ( ˆ f n ) is also a random quantity ▸ for a given f ∶ X → Y , the quantity L ( f ) is not random! It is important that these are understood before we proceed further. 26 / 130

  7. The Gold Standard Within the framework we set up, the smallest expected loss is achieved by the Bayes optimal function f ∗ = arg min L ( f ) f where the minimization is over all (measurable) prediction rules f ∶ X ↦ Y . The value of the lowest expected loss is called the Bayes error : L ( f ∗ ) = inf f L ( f ) Of course, we cannot calculate any of these quantities since P is unknown. 27 / 130

  8. Bayes Optimal Function Bayes optimal function f ∗ takes on the following forms in these two particular cases: ▸ Binary classification ( Y = { 0, 1 } ) with the indicator loss: f ∗ ( x ) = I { η ( x )≥ 1 / 2 } , η ( x ) = E [ Y ∣ X = x ] where 1 η ( x ) 0 28 / 130

  9. Bayes Optimal Function Bayes optimal function f ∗ takes on the following forms in these two particular cases: ▸ Binary classification ( Y = { 0, 1 } ) with the indicator loss: f ∗ ( x ) = I { η ( x )≥ 1 / 2 } , η ( x ) = E [ Y ∣ X = x ] where 1 η ( x ) 0 ▸ Regression ( Y = R ) with squared loss: f ∗ ( x ) = η ( x ) , η ( x ) = E [ Y ∣ X = x ] where 28 / 130

  10. The big question: is there a way to construct a learning algorithm with a guarantee that L ( ˆ f n ) − L ( f ∗ ) is small for large enough sample size n ? 29 / 130

  11. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 30 / 130

  12. Consistency An algorithm that ensures n →∞ L ( ˆ f n ) = L ( f ∗ ) lim almost surely is called consistent . Consistency ensures that our algorithm is approaching the best possible prediction performance as the sample size increases. The good news: consistency is possible to achieve. ▸ easy if X is a finite or countable set ▸ not too hard if X is infinite, and the underlying relationship between x and y is “continuous” 31 / 130

  13. The bad news... In general, we cannot prove anything “interesting” about L ( ˆ f n ) − L ( f ∗ ) , unless we make further assumptions (incorporate prior knowledge ). What do we mean by “nothing interesting”? This is the subject of the so-called “No Free Lunch” Theorems. Unless we posit further assumptions, 32 / 130

  14. The bad news... In general, we cannot prove anything “interesting” about L ( ˆ f n ) − L ( f ∗ ) , unless we make further assumptions (incorporate prior knowledge ). What do we mean by “nothing interesting”? This is the subject of the so-called “No Free Lunch” Theorems. Unless we posit further assumptions, f n , any n and any ǫ > 0, there exists a distribution ▸ For any algorithm ˆ P such that L ( f ∗ ) = 0 and E L ( ˆ f n ) ≥ 1 2 − ǫ 32 / 130

  15. The bad news... In general, we cannot prove anything “interesting” about L ( ˆ f n ) − L ( f ∗ ) , unless we make further assumptions (incorporate prior knowledge ). What do we mean by “nothing interesting”? This is the subject of the so-called “No Free Lunch” Theorems. Unless we posit further assumptions, f n , any n and any ǫ > 0, there exists a distribution ▸ For any algorithm ˆ P such that L ( f ∗ ) = 0 and E L ( ˆ f n ) ≥ 1 2 − ǫ ▸ For any algorithm ˆ f n , and any sequence a n that converges to 0, there exists a probability distribution P such that L ( f ∗ ) = 0 and for all n E L ( ˆ f n ) ≥ a n Reference: (Devroye, Gy¨ orfi, Lugosi: A Probabilistic Theory of Pattern Recognition ), (Bousquet, Boucheron, Lugosi, 2004) 32 / 130

  16. is this really “bad news”? Not really. We always have some domain knowledge. Two ways of incorporating prior knowledge: ▸ Direct way: assume that the distribution P is not arbitrary (also known as a modeling approach, generative approach, statistical modeling) ▸ Indirect way: redefine the goal to perform as well as a reference set F of predictors: L ( ˆ f n ) − inf f ∈F L ( f ) This is known as a discriminative approach. F encapsulates our inductive bias . 33 / 130

  17. Pros/Cons of the two approaches Pros of the discriminative approach: we never assume that P takes some particular form, but we rather put our prior knowledge into “what are the types of predictors that will do well”. Cons: cannot really interpret ˆ f n . Pros of the generative approach: can estimate the model / parameters of the distribution ( inference ). Cons: it is not clear what the analysis says if the assumption is actually violated. Both approaches have their advantages. A machine learning researcher or practitioner should ideally know both and should understand their strengths and weaknesses. In this tutorial we only focus on the discriminative approach. 34 / 130

  18. Example: Linear Discriminant Analysis Consider the classification problem with Y = { 0, 1 } . Suppose the class-conditional densities are multivariate Gaussian with the same covariance Σ = I : p ( x ∣ y = 0 ) = ( 2 π ) − k / 2 exp {− 1 2 ∥ x − µ 0 ∥ 2 } and p ( x ∣ y = 1 ) = ( 2 π ) − k / 2 exp {− 1 2 ∥ x − µ 1 ∥ 2 } The “best” (Bayes) classifier is f ∗ = I { P ( y = 1 ∣ x ) ≥ 1 / 2 } which corresponds to the half-space defined by the decision boundary p ( x ∣ y = 1 ) ≥ p ( x ∣ y = 0 ) . This boundary is linear . 35 / 130

  19. Example: Linear Discriminant Analysis The (linear) optimal decision boundary comes from our generative assumption on the form of the underlying distribution. Alternatively, we could have indirectly postulated that we will be looking for a linear discriminant between the two classes, without making distributional assumptions. Such linear discriminant (classification) functions are I {⟨ w , x ⟩≥ b } for a unit-norm w and some bias b ∈ R . Quadratic Discriminant Analysis: If unequal correlation matrices Σ 1 and Σ 2 are assumed, the resulting boundary is quadratic. We can then define classification function by I { q ( x )≥ 0 } where q ( x ) is a quadratic function. 36 / 130

  20. Bias-Variance Tradeoff How do we choose the inductive bias F ? L ( ˆ f n ) − L ( f ∗ ) = L ( ˆ f n ) − inf f ∈F L ( f ) + f ∈F L ( f ) − L ( f ∗ ) inf �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� Estimation Error Approximation Error ˆ f ∗ f F f n F Clearly, the two terms are at odds with each other: ▸ Making F larger means smaller approximation error but (as we will see) larger estimation error ▸ Taking a larger sample n means smaller estimation error and has no effect on the approximation error. ▸ Thus, it makes sense to trade off size of F and n . This is called Structural Risk Minimization , or Method of Sieves , or Model Selection . 37 / 130

  21. Bias-Variance Tradeoff We will only focus on the estimation error, yet the ideas we develop will make it possible to read about model selection on your own. Note: if we guessed correctly and f ∗ ∈ F , then L ( ˆ f n ) − L ( f ∗ ) = L ( ˆ f n ) − inf f ∈F L ( f ) For a particular problem, one hopes that prior knowledge about the problem can ensure that the approximation error inf f ∈F L ( f ) − L ( f ∗ ) is small. 38 / 130

  22. Occam’s Razor Occam’s Razor is often quoted as a principle for choosing the simplest theory or explanation out of the possible ones. However, this is a rather philosophical argument since simplicity is not uniquely defined. We will discuss this issue later. What we will do is to try to understand “complexity” when it comes to behavior of certain stochastic processes. Such a question will be well-defined mathematically. 39 / 130

  23. Looking Ahead So far: represented prior knowledge by means of the class F . Looking forward, we can find an algorithm that, after looking at a dataset of size n , produces ˆ f n such that L ( ˆ f n ) − inf f ∈F L ( f ) decreases (in a certain sense which we will make precise) at a non-trivial rate which depends on “richness” of F . This will give a sample complexity guarantee: how many samples are needed to make the error smaller than a desired accuracy. 40 / 130

  24. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 41 / 130

  25. Types of Bounds In expectation vs in probability (control the mean vs control the tails): E { L ( ˆ f n ) − inf f ∈F L ( f )} < ψ ( n ) P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ ǫ ) < ψ ( n , ǫ ) vs 42 / 130

  26. Types of Bounds In expectation vs in probability (control the mean vs control the tails): E { L ( ˆ f n ) − inf f ∈F L ( f )} < ψ ( n ) P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ ǫ ) < ψ ( n , ǫ ) vs The in-probability bound can be inverted as P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ φ ( δ , n )) < δ by setting δ ∶ = ψ ( ǫ , n ) and solving for ǫ . In this lecture, we are after the function φ ( δ , n ) . We will call it “the rate”. “With high probability” typically means logarithmic dependence of φ ( δ , n ) on 1 / δ . Very desirable: the bound grows only modestly even for high confidence bounds. 42 / 130

  27. Sample Complexity Sample complexity is the sample size required by the algorithm ˆ f n to guarantee L ( ˆ f n ) − inf f ∈F L ( f ) ≤ ǫ with probability at least 1 − δ . Of course, we just need to invert a bound P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ φ ( δ , n )) < δ by setting ǫ ∶ = φ ( δ , n ) and solving for n . In other words, n ( ǫ , δ ) is sample complexity of the algorithm ˆ f n if P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ ǫ ) ≤ δ as soon as n ≥ n ( ǫ , δ ) . Hence, “rate” can be translated into “sample complexity” and vice versa. Easy to remember: rate O ( 1 /√ n ) means O ( 1 / ǫ 2 ) sample complexity, whereas rate O ( 1 / n ) is a smaller O ( 1 / ǫ ) sample complexity. 43 / 130

  28. Types of Bounds Other distinctions to keep in mind: We can ask for bounds (either in expectation or in probability) on the following random variables: L ( ˆ f n ) − L ( f ∗ ) ( A ) L ( ˆ f n ) − inf f ∈F L ( f ) ( B ) L ( ˆ f n ) − ˆ L ( ˆ f n ) ( C ) { L ( f ) − ˆ L ( f )} ( D ) sup f ∈F { L ( f ) − ˆ L ( f ) − pen n ( f )} ( E ) sup f ∈F Let’s make sure we understand the differences between these random quantities! 44 / 130

  29. Types of Bounds Upper bounds on ( D ) and ( E ) are used as tools for achieving the other bounds. Let’s see why. f n ∈ F , Obviously, for any algorithm that outputs ˆ L ( ˆ f n ) − ˆ L ( ˆ f n ) ≤ sup { L ( f ) − ˆ L ( f )} f ∈F and so a bound on ( D ) implies a bound on ( C ) . How about a bound on ( B ) ? Is it implied by ( C ) or ( D ) ? It depends on what the algorithm does! Denote f F = arg min f ∈F L ( f ) . Suppose ( D ) is small. It then makes sense to ask the learning algorithm to minimize or (approximately minimize) the empirical error (why?) 45 / 130

  30. Canonical Algorithms Empirical Risk Minimization (ERM) algorithm: f n = arg min L ( f ) ˆ ˆ f ∈F Regularized Empirical Risk Minimization algorithm: f n = arg min L ( f ) + pen n ( f ) ˆ ˆ f ∈F We will deal with the regularized ERM a bit later. For now, let’s focus on ERM. Remark: to actually compute f ∈ F minimizing the above objectives, one needs to employ some optimization methods. In practice, the objective might be optimized only approximately. 46 / 130

  31. Performance of ERM If ˆ f n is an ERM, L ( ˆ f n ) − L ( f F ) ≤ { L ( ˆ f n ) − ˆ L ( ˆ f n )} + { ˆ L ( ˆ f n ) − ˆ L ( f F )} + { ˆ L ( f F ) − L ( f F )} ≤ { L ( ˆ f n ) − ˆ L ( ˆ f n )} + { ˆ L ( f F ) − L ( f F )} �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� ( C ) ≤ sup { L ( f ) − ˆ L ( f )} + { ˆ L ( f F ) − L ( f F )} f ∈F �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� ( D ) because the second term is negative. So, ( C ) also implies a bound on ( B ) f n is ERM (or “close” to ERM). Also, ( D ) also implies a bound on when ˆ ( B ) . L ( f F ) − L ( f F ) ? Central Limit Theorem says What about this extra term ˆ that for i.i.d. random variables with bounded second moment, the average converges to the expectation. Let’s quantify this. 47 / 130

  32. Hoeffding Inequality Let W , W 1 , . . . , W n be i.i.d. such that P ( a ≤ W ≤ b ) = 1. Then 2 nǫ 2 P ( E W − 1 ∑ n W i > ǫ ) ≤ exp ( − ( b − a ) 2 ) n i = 1 and 2 nǫ 2 P ( 1 ∑ n W i − E W > ǫ ) ≤ exp ( − ( b − a ) 2 ) n i = 1 Let W i = ℓ ( f F ( x i ) , y i ) . Clearly, W 1 , . . . , W i are i.i.d. Then, P (∣ L ( f F ) − ˆ L ( f F )∣ > ǫ ) ≤ 2 exp ( − ( b − a ) 2 ) 2 nǫ 2 assuming a ≤ ℓ ( f F ( x ) , y ) ≤ b for all x ∈ X , y ∈ Y . 48 / 130

  33. Wait, Are We Done? Can’t we conclude directly that ( C ) is small? That is, P ( E ℓ ( ˆ f n ( x ) , y ) − 1 ℓ ( ˆ f n ( x i ) , y i ) > ǫ ) ≤ 2 exp ( − ( b − a ) 2 ) 2 nǫ 2 ∑ n ? n i = 1 49 / 130

  34. Wait, Are We Done? Can’t we conclude directly that ( C ) is small? That is, P ( E ℓ ( ˆ f n ( x ) , y ) − 1 ℓ ( ˆ f n ( x i ) , y i ) > ǫ ) ≤ 2 exp ( − ( b − a ) 2 ) 2 nǫ 2 ∑ n ? n i = 1 No! The random variables ℓ ( ˆ f n ( x i ) , y i ) are not necessarily independent and it is possible that E ℓ ( ˆ f n ( x ) , y ) = E W ≠ E ℓ ( ˆ f n ( x i ) , y i ) = E W i The expected loss is “out of sample performance” while the second term is “in sample”. We say that ℓ ( ˆ f n ( x i ) , y i ) is a biased estimate of E ℓ ( ˆ f n ( x ) , y ) . How bad can this bias be? 49 / 130

  35. Example ▸ X = [ 0, 1 ] , Y = { 0, 1 } ▸ ℓ ( f ( X i ) , Y i ) = I { f ( X i )≠ Y i } ▸ distribution P = P x × P y ∣ x with P x = Unif [ 0, 1 ] and P y ∣ x = δ y = 1 ▸ function class F = ∪ n ∈ N { f = f S ∶ S ⊂ X , ∣ S ∣ = n , f S ( x ) = I { x ∈ S } } 1 0 1 ERM ˆ f n memorizes (perfectly fits) the data, but has no ability to generalize. Observe that 0 = E ℓ ( ˆ f n ( x i ) , y i ) ≠ E ℓ ( ˆ f n ( x ) , y ) = 1 This phenomenon is called overfitting . 50 / 130

  36. Example Not only is ( C ) large in this example. Also, uniform deviations ( D ) do not converge to zero. For any n ∈ N and any ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ P { E x , y ℓ ( f ( x ) , y ) − 1 ∑ n ℓ ( f ( x i ) , y i )} = 1 sup n f ∈F i = 1 Where do we go from here? Two approaches: 1. understand how to upper bound uniform deviations ( D ) 2. find properties of algorithms that limit in some way the bias of ℓ ( ˆ f n ( x i ) , y i ) . Stability and compression are two such approaches. 51 / 130

  37. Uniform Deviations We first focus on understanding { E x , y ℓ ( f ( x ) , y ) − 1 ℓ ( f ( x i ) , y i )} ∑ n sup n f ∈F i = 1 If F = { f 0 } consists of a single function, then clearly { E ℓ ( f ( x ) , y ) − 1 n ℓ ( f ( x i ) , y i )} = { E ℓ ( f 0 ( x ) , y ) − 1 n ℓ ( f 0 ( x i ) , y i )} ∑ ∑ sup n n f ∈ F i = 1 i = 1 This quantity is O P ( 1 /√ n ) by Hoeffding’s inequality, assuming a ≤ ℓ ( f 0 ( x ) , y ) ≤ b . Moral: for “simple” classes F the uniform deviations ( D ) can be bounded while for “rich” classes not. We will see how far we can push the size of F . 52 / 130

  38. A bit of notation to simplify things... To ease the notation, ▸ Let z i = ( x i , y i ) so that the training data is { z 1 , . . . , z n } ▸ g ( z ) = ℓ ( f ( x ) , y ) for z = ( x , y ) ▸ Loss class G = { g ∶ g ( z ) = ℓ ( f ( x ) , y )} = ℓ ○ F g n = ℓ ( ˆ f n ( ⋅ ) , ⋅ ) , g G = ℓ ( f F ( ⋅ ) , ⋅ ) ▸ ˆ ▸ g ∗ = arg min g E g ( z ) = ℓ ( f ∗ ( ⋅ ) , ⋅ ) is Bayes optimal (loss) function We can now work with the set G , but keep in mind that each g ∈ G corresponds to an f ∈ F : g ∈ G → f ∈ F ← Once again, the quantity of interest is { E g ( z ) − 1 n g ( z i )} ∑ sup n g ∈G i = 1 On the next slide, we visualize deviations E g ( z ) − 1 n ∑ n i = 1 g ( z i ) for all possible functions g and discuss all the concepts introduces so far. 53 / 130

  39. Empirical Process Viewpoint E g 0 g ∗ all functions 54 / 130

  40. Empirical Process Viewpoint n 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

  41. Empirical Process Viewpoint n 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

  42. Empirical Process Viewpoint n 1 X g ( z i ) E g n i = 1 0 g ∗ ˆ g n all functions 54 / 130

  43. Empirical Process Viewpoint n 1 X g ( z i ) n i = 1 0 g ∗ ˆ g n 54 / 130

  44. Empirical Process Viewpoint n G 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

  45. Empirical Process Viewpoint n G 1 X g ( z i ) E g n i = 1 0 g ∗ g G ˆ g n all functions 54 / 130

  46. Empirical Process Viewpoint n G 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

  47. Empirical Process Viewpoint A stochastic process is a collection of random variables indexed by some set. An empirical process is a stochastic process { E g ( z ) − 1 n g ( z i )} ∑ n g ∈ G i = 1 indexed by a function class G . Uniform Law of Large Numbers: ∣ E g − 1 ∑ n g ( z i )∣ → 0 sup n g ∈ G i = 1 in probability. 55 / 130

  48. Empirical Process Viewpoint A stochastic process is a collection of random variables indexed by some set. An empirical process is a stochastic process { E g ( z ) − 1 n g ( z i )} ∑ n g ∈ G i = 1 indexed by a function class G . Uniform Law of Large Numbers: ∣ E g − 1 ∑ n g ( z i )∣ → 0 sup n g ∈ G i = 1 in probability. Key question: How “big” can G be for the supremum of the empirical process to still be manageable? 55 / 130

  49. Union Bound (Boole’s inequality) Boole’s inequality: for a finite or countable set of events, P ( ∪ j A j ) ≤ ∑ P ( A j ) j Let G = { g 1 , . . . , g N } . Then P ( ∃ g ∈ G ∶ E g − 1 g ( z i ) > ǫ ) ≤ P ( E g j − 1 g j ( z i ) > ǫ ) ∑ n ∑ N ∑ n n n i = 1 j = 1 i = 1 Assuming P ( a ≤ g ( z i ) ≤ b ) = 1 for every g ∈ G , P ( sup { E g − 1 n g ( z i )} > ǫ ) ≤ N exp ( − ( b − a ) 2 ) 2 nǫ 2 ∑ n g ∈ G i = 1 56 / 130

  50. Finite Class Alternatively, we set δ = N exp (− 2 nǫ 2 ( b − a ) 2 ) and write √ P ⎛ log ( N ) + log ( 1 / δ ) ⎞ { E g − 1 g ( z i )} > ( b − a ) ⎠ ≤ δ ∑ n ⎝ sup n 2 n g ∈G i = 1 Another way to write it: with probability at least 1 − δ , √ log ( N ) + log ( 1 / δ ) { E g − 1 ∑ n g ( z i )} ≤ ( b − a ) sup n 2 n g ∈ G i = 1 Hence, with probability at least 1 − δ , the ERM algorithm ˆ f n for a class F of cardinality N satisfies √ log ( N ) + log ( 1 / δ ) L ( ˆ f n ) − inf f ∈ F L ( f ) ≤ 2 ( b − a ) 2 n assuming a ≤ ℓ ( f ( x ) , y ) ≤ b for all f ∈ F , x ∈ X , y ∈ Y . The constant 2 is due to the L ( f F ) − ˆ L ( f F ) term. This is a loose upper bound. 57 / 130

  51. Once again... A take-away message is that the following two statements are worlds apart: with probability at least 1 − δ , for any g ∈ G , E g − 1 g ( z i ) ≤ ǫ ∑ n n i = 1 vs for any g ∈ G , with probability at least 1 − δ , E g − 1 n g ( z i ) ≤ ǫ ∑ n i = 1 The second statement follows from CLT, while the first statement is often difficult to obtain and only holds for some G . 58 / 130

  52. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 59 / 130

  53. Countable Class: Weighted Union Bound Let G be countable and fix a distribution w on G such that ∑ g ∈G w ( g ) ≤ 1. For any δ > 0, for any g ∈ G √ P ⎛ ⎞ log 1 / w ( g ) + log ( 1 / δ ) ⎝ E g − 1 n g ( z i ) ≥ ( b − a ) ⎠ ≤ δ ⋅ w ( g ) ∑ n 2 n i = 1 by Hoeffding’s inequality (easy to verify!). By the Union Bound, √ P ⎛ ⎞ log 1 / w ( g ) + log ( 1 / δ ) ⎝ ∃ g ∈ G ∶ E g − 1 n g ( z i ) ≥ ( b − a ) ⎠ ≤ δ ∑ w ( g ) ≤ δ ∑ 2 n n g ∈ G i = 1 Therefore, with probability at least 1 − δ , for all f ∈ F √ log 1 / w ( f ) + log ( 1 / δ ) L ( f ) − ˆ L ( f ) ≤ ( b − a ) 2 n �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� pen n ( f ) 60 / 130

  54. Countable Class: Weighted Union Bound If ˆ f n is a regularized ERM, L ( ˆ f n ) − L ( f F ) ≤ { L ( ˆ f n ) − ˆ L ( ˆ f n ) − pen n ( ˆ f n )} + { ˆ L ( ˆ f n ) + pen n ( ˆ f n ) − ˆ L ( f F ) − pen n ( f F )} + { ˆ L ( f F ) − L ( f F )} + pen n ( f F ) ≤ sup { L ( f ) − ˆ L ( f ) − pen n ( f )} + { ˆ L ( f F ) − L ( f F )} + pen n ( f F ) f ∈F So, ( E ) implies a bound on ( B ) when ˆ f n is regularized ERM. From the weighted union bound for a countable class: L ( ˆ f n ) − L ( f F ) ≤ { ˆ L ( f F ) − L ( f F )} + pen n ( f F ) √ log 1 / w ( f F ) + log ( 1 / δ ) ≤ 2 ( b − a ) 2 n 61 / 130

  55. Uncountable Class: Compression Bounds Let us make the dependence of the algorithm ˆ f n on the training set S = {( x 1 , y 1 ) , . . . , ( x n , y n )} explicit: ˆ f n = ˆ f n [ S ] . Suppose F has the property that there exists a “compression function” C k which selects from any dataset S of any size n a subset of k labeled examples C k ( S ) ⊆ S such that the algorithm can be written as f n [ S ] = ˆ f k [ C k ( S )] ˆ Then, L ( ˆ f n ) − ˆ L ( ˆ f n ) = E ℓ ( ˆ f k [ C k ( S )]( x ) , y ) − 1 n ℓ ( ˆ f k [ C k ( S )]( x i ) , y i ) ∑ n i = 1 ≤ I ⊂ { 1,..., n } , ∣ I ∣ ≤ k { E ℓ ( ˆ f k [ S I ]( x ) , y ) − 1 n ℓ ( ˆ f k [ S I ]( x i ) , y i )} ∑ max n i = 1 62 / 130

  56. Uncountable Class: Compression Bounds f k [ S I ] only depends on k out of n points, the empirical average is Since ˆ “mostly out of sample”. Adding and subtracting ℓ ( ˆ f k [ S I ]( x ′ ) , y ′ ) 1 ∑ n ( x ′ , y ′ )∈ W for an additional set of i.i.d. random variables W = {( x ′ 1 ) , . . . , ( x ′ k )} 1 , y ′ k , y ′ results in an upper bound ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ + ( b − a ) k ⎪ ⎪ ⎨ E ℓ ( ˆ f k [ S I ]( x ) , y ) − 1 ℓ ( ˆ f k [ S I ]( x ) , y ) ⎬ ∑ ⎪ ⎪ max ⎪ ⎪ ⎪ ⎪ I ⊂ { 1,..., n } , ∣ I ∣ ≤ k ⎩ n ⎭ n ( x , y ) ∈ S ∖ S I ∪ W ∣ I ∣ We appeal to the union bound over the ( n k ) possibilities, with a Hoeffding’s bound for each. Then with probability at least 1 − δ , √ k log ( en / k ) + log ( 1 / δ ) + ( b − a ) k L ( ˆ f n ) − inf f ∈ F L ( f ) ≤ 2 ( b − a ) 2 n n assuming a ≤ ℓ ( f ( x ) , y ) ≤ b for all f ∈ F , x ∈ X , y ∈ Y . 63 / 130

  57. Example: Classification with Thresholds in 1D ▸ X = [ 0, 1 ] , Y = { 0, 1 } ▸ F = { f θ ∶ f θ ( x ) = I { x ≥ θ } , θ ∈ [ 0, 1 ]} ▸ ℓ ( f θ ( x ) , y ) = I { f θ ( x )≠ y } ˆ f n 0 1 For any set of data ( x 1 , y 1 ) , . . . , ( x n , y n ) , the ERM solution ˆ f n has the property that the first occurrence x l on the left of the threshold has label y l = 0, while first occurrence x r on the right – label y r = 1. Enough to take k = 2 and define ˆ f n [ S ] = ˆ f 2 [( x l , 0 ) , ( x r , 1 )] . 64 / 130

  58. Stability Yet another way to limit the bias of ℓ ( ˆ f n ( x i ) , y i ) as an estimate of L ( ˆ f n ) is through a notion of stability. An algorithm ˆ f n is stable if a change (or removal) of a single data point does not change (in a certain mathematical sense) the function ˆ f n by much. f n = f 0 without even looking at Of course, a dumb algorithm which outputs ˆ data is very stable and ℓ ( ˆ f n ( x i ) , y i ) are independent random variables... But it is not a good algorithm! We would like to have an algorithm that both approximately minimizes the empirical error and is stable. Turns out, certain types of regularization methods are stable. Example: ( f ( x i ) − y i ) 2 + λ ∥ f ∥ 2 f n = arg min n 1 ∑ ˆ K f ∈F n i = 1 where ∥ ⋅ ∥ is the norm induced by the kernel of a reproducing kernel Hilbert space (RKHS) F . 65 / 130

  59. Summary so far We proved upper bounds on L ( ˆ f n ) − L ( f F ) for ▸ ERM over a finite class ▸ Regularized ERM over a countable class (weighted union bound) ▸ ERM over classes F with the compression property ▸ ERM or Regularized ERM that are stable (only sketched it) What about a more general situation? Is there a way to measure complexity of F that tells us whether ERM will succeed? 66 / 130

  60. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 67 / 130

  61. Uniform Convergence and Symmetrization Let z ′ 1 , . . . , z ′ n be another set of n i.i.d. random variables from P . Let ǫ 1 , . . . , ǫ n be i.i.d. Rademacher random variables: P ( ǫ i = − 1 ) = P ( ǫ i = + 1 ) = 1 / 2 Let’s get through a few manipulations: { E g ( z ) − 1 ∑ n g ( z i )} = E z 1 ∶ n sup { E z ′ 1 ∶ n { 1 ∑ n g ( z ′ i )} − 1 ∑ n g ( z i )} E sup n n n g ∈G g ∈ G i = 1 i = 1 i = 1 By Jensen’s inequality, this is upper bounded by { 1 n g ( z ′ i ) − 1 n g ( z i )} ∑ ∑ 1 ∶ n sup E z 1 ∶ n , z ′ n n g ∈ G i = 1 i = 1 which is equal to { 1 n ǫ i ( g ( z ′ i ) − g ( z i ))} ∑ 1 ∶ n sup E ǫ 1 ∶ n E z 1 ∶ n , z ′ n g ∈ G i = 1 68 / 130

  62. Uniform Convergence and Symmetrization { 1 n ǫ i ( g ( z ′ i ) − g ( z i ))} ∑ 1 ∶ n sup E ǫ 1 ∶ n E z 1 ∶ n , z ′ n g ∈G i = 1 ≤ E sup { 1 ∑ n ǫ i g ( z ′ i )} + E sup { 1 ∑ n − ǫ i g ( z i )} n n g ∈ G g ∈ G i = 1 i = 1 = 2 E sup { 1 n ǫ i g ( z i )} ∑ n g ∈ G i = 1 The empirical Rademacher averages of G are defined as ̂ R n ( G ) = E [ sup { 1 n ǫ i g ( z i )} ∣ z 1 , . . . , z n ] ∑ g ∈ G n i = 1 The Rademacher average (or Rademacher complexity ) of G is R n ( G ) = E z 1 ∶ n ̂ R n ( G ) 69 / 130

  63. Classification: Loss Function Disappears Let us focus on binary classification with indicator loss and let F be a class of { 0, 1 } -valued functions. We have ℓ ( f ( x ) , y ) = I { f ( x )≠ y } = ( 1 − 2 y ) f ( x ) + y and thus ̂ R n ( G ) = E [ sup { 1 n ǫ i ( f ( x i )( 1 − 2 y i ) + y i )} ∣ ( x 1 , y 1 ) . . . , ( x n , y n )] ∑ n f ∈F i = 1 ǫ i f ( x i )} ∣ x 1 , . . . , x n ] = ̂ = E [ sup { 1 ∑ n R n ( F ) n f ∈ F i = 1 because, given y 1 , . . . , y n , the distribution of ǫ i ( 1 − 2 y i ) is the same as ǫ i . 70 / 130

  64. Vapnik-Chervonenkis Theory for Classification We are now left examining E [ sup { 1 ∑ n ǫ i f ( x i )} ∣ x 1 , . . . , x n ] n f ∈F i = 1 Given x 1 , . . . , x n , define the projection of F onto sample: F ∣ x 1 ∶ n = {( f ( x 1 ) , . . . , f ( x n )) ∈ { 0, 1 } n ∶ f ∈ F } ⊆ { 0, 1 } n Clearly, this is a finite set and √ 2 log card ( F ∣ x 1 ∶ n ) ̂ R n ( F ) = E ǫ 1 ∶ n ∑ n ǫ i v i ≤ 1 max v ∈ F∣ x 1 ∶ n n n i = 1 This is because a maximum of N (sub)Gaussian random variables ∼ √ log N . The bound is nontrivial as long as log card ( F ∣ x 1 ∶ n ) = o ( n ) . 71 / 130

  65. Vapnik-Chervonenkis Theory for Classification The growth function is defined as Π F ( n ) = max { card ( F ∣ x 1 ,..., x n ) ∶ x 1 , . . . , x n ∈ X } The growth function measures expressiveness of F . In particular, if F can produce all possible signs (that is, Π F ( n ) = 2 n ), the bound becomes useless. We say that F shatters some set x 1 , . . . , x n if F ∣ x n = { 0, 1 } n . The Vapnik-Chervonenkis (VC) dimension of the class F is defined as vc ( F ) = max { d ∶ Π F ( t ) = 2 t } Vapnik-Chervonenkis-Sauer-Shelah Lemma: If d = vc ( F ) < ∞ , then Π F ( n ) ≤ d ( n d ) ≤ ( en d ) d ∑ i = 0 72 / 130

  66. Vapnik-Chervonenkis Theory for Classification Conclusion: for any F with vc ( F ) < ∞ , the ERM algorithm satisfies √ 2 d log ( en / d ) E { L ( ˆ f n ) − inf f ∈F L ( f )} ≤ 2 n While we proved the result in expectation, the same type of bound holds with high probability. VC dimension is a combinatorial dimension of a binary-valued function class. Its finiteness is necessary and sufficient for learnability if we place no assumptions on the distribution P . Remark: the bound is similar to that obtained through compression. In fact, the exact relationship between compression and VC dimension is still an open question. 73 / 130

  67. Vapnik-Chervonenkis Theory for Classification Examples of VC classes: ▸ Half-spaces F = { I {⟨ w , x ⟩+ b ≥ 0 } ∶ w ∈ R d , ∥ w ∥ = 1, b ∈ R } has vc ( F ) = d + 1 ▸ For a vector space H of dimension d , VC dimension of F = { I { h ( x )≥ 0 } ∶ h ∈ H } is at most d ▸ The set of Euclidean balls F = { I {∑ d i = 1 ∥ x i − a i ∥ 2 ≤ b } ∶ a ∈ R d , b ∈ R } has VC dimension at most d + 2. ▸ Functions that can be computed using a finite number of arithmetic operations (see (Goldberg and Jerrum, 1995) ) However: F = { f α ( x ) = I { sin ( αx )≥ 0 } ∶ α ∈ R } has infinite VC dimension, so it is not correct to think of VC dimension as the number of parameters! 74 / 130

  68. Vapnik-Chervonenkis Theory for Classification Examples of VC classes: ▸ Half-spaces F = { I {⟨ w , x ⟩+ b ≥ 0 } ∶ w ∈ R d , ∥ w ∥ = 1, b ∈ R } has vc ( F ) = d + 1 ▸ For a vector space H of dimension d , VC dimension of F = { I { h ( x )≥ 0 } ∶ h ∈ H } is at most d ▸ The set of Euclidean balls F = { I {∑ d i = 1 ∥ x i − a i ∥ 2 ≤ b } ∶ a ∈ R d , b ∈ R } has VC dimension at most d + 2. ▸ Functions that can be computed using a finite number of arithmetic operations (see (Goldberg and Jerrum, 1995) ) However: F = { f α ( x ) = I { sin ( αx )≥ 0 } ∶ α ∈ R } has infinite VC dimension, so it is not correct to think of VC dimension as the number of parameters! Unfortunately, the VC theory is unable to explain the good performance of neural networks and Support Vector Machines! This prompted the development of a margin-based theory. 74 / 130

  69. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 75 / 130

  70. Classification with Real-Valued Functions Many methods use I ( F ) = { I { f ≥ 0 } ∶ f ∈ F } for classification. The VC dimension can be very large, yet in practice the methods work well. Example: f ( x ) = f w ( x ) = ⟨ w , ψ ( x )⟩ where ψ is a mapping to a high- dimensional feature space (see Kernel Methods). The VC dimension of the set is typically huge (equal to the dimensionality of ψ ( x ) ) or infinite, yet the methods perform well! Is there an explanation beyond VC theory? 76 / 130

  71. Margins Hard margin: ∃ f ∈ F ∶ ∀ i , y i f ( x i ) ≥ γ f ( x ) More generally, we hope to have card ({ i ∶ y i f ( x i ) < γ }) ∃ f ∈ F ∶ is small n 77 / 130

  72. Surrogate Loss Define ⎧ ⎪ if s ≤ 0 ⎪ ⎪ ⎪ 1 φ ( s ) = ⎨ 1 − s / γ if 0 < s < γ ⎪ ⎪ ⎪ ⎪ if s ≥ γ ⎩ 0 I { y ≠ sign ( f ( x ))} = I { yf ( x )≤ 0 } ≤ φ ( yf ( x )) ≤ ψ ( yf ( x )) = I { yf ( x )≤ γ } Then: The function φ is an example of a surrogate loss function . φ ( yf ( x )) ψ ( yf ( x )) I { yf ( x ) 6 0 } yf ( x ) γ Let L φ ( f ) = E φ ( yf ( x )) L φ ( f ) = 1 n φ ( y i f ( x i )) ∑ ˆ and n i = 1 Then L ( f ) ≤ L φ ( f ) , L φ ( f ) ≤ ˆ L ψ ( f ) ˆ 78 / 130

  73. Surrogate Loss Now consider uniform deviations for the surrogate loss: { L φ ( f ) − ˆ L φ ( f )} E sup f ∈F We had shown that this quantity is at most 2 R n ( φ ( F )) for φ ( F ) = { g ( z ) = φ ( yf ( x )) ∶ f ∈ F } A useful property of Rademacher averages: R n ( φ ( F )) ≤ L R n ( F ) if φ is L -Lipschitz. Observe that in our example φ is 1 / γ -Lipschitz. Hence, { L φ ( f ) − ˆ L φ ( f )} ≤ 2 γ R n ( F ) E sup f ∈F 79 / 130

  74. Margin Bound Same result in high probability: with probability at least 1 − δ , √ log ( 1 / δ ) { L φ ( f ) − ˆ L φ ( f )} ≤ 2 γ R n (F) + sup 2 n f ∈F With probability at least 1 − δ , for all f ∈ F √ log ( 1 / δ ) L ( f ) ≤ ˆ L ψ ( f ) + 2 γ R n (F) + 2 n If ˆ f n is minimizing margin loss f n = arg min ∑ n φ ( y i f ( x i )) 1 ˆ f ∈F n i = 1 then with probability at least 1 − δ √ log ( 1 / δ ) L ( ˆ f n ) ≤ inf f ∈ F L ψ ( f ) + 4 γ R n (F) + 2 2 n Note: φ assumes knowledge of γ , but this assumption can be removed. 80 / 130

  75. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 81 / 130

  76. Useful Properties 1. If F ⊆ G , then ̂ R n ( F ) ≤ ̂ R n ( G ) R n ( F ) = ̂ ̂ R n ( conv ( F )) 2. 3. For any c ∈ R , ̂ R n ( c F ) = ∣ c ∣ ̂ R n ( F ) 4. If φ ∶ R ↦ R is L -Lipschitz (that is, φ ( a ) − φ ( b ) ≤ L ∣ a − b ∣ for all a , b ∈ R ), then R n ( φ ○ F ) ≤ L ̂ ̂ R n ( F ) 82 / 130

  77. Rademacher Complexity of Kernel Classes ▸ Feature map φ ∶ X ↦ ℓ 2 and p.d. kernel K ( x 1 , x 2 ) = ⟨ φ ( x 1 ) , φ ( x 2 )⟩ ▸ The set F B = { f ( x ) = ⟨ w , φ ( x )⟩ ∶ ∥ w ∥ ≤ B } is a ball in H ▸ Reproducing property f ( x ) = ⟨ f , K ( x , ⋅ )⟩ An easy calculation shows that empirical Rademacher averages are upper bounded as ̂ R n ( F B ) = E sup ∑ n ǫ i f ( x i ) = E sup ∑ n ǫ i ⟨ f , K ( x i , ⋅ )⟩ 1 1 n n f ∈F 1 f ∈ F B i = 1 i = 1 = E sup ⟨ f , 1 n ǫ i K ( x i , ⋅ )⟩ = B ⋅ E ∥ 1 n ǫ i K ( x i , ⋅ )∥ ∑ ∑ n n f ∈ F B i = 1 i = 1 n E ⎛ ǫ i ǫ j ⟨ K ( x i , ⋅ ) , K ( x j , ⋅ )⟩⎞ − 1 / 2 = B n ∑ ⎝ ⎠ i , j = 1 − 1 / 2 ≤ B n ( n K ( x i , x i )) ∑ i = 1 A data-independent bound of O ( Bκ /√ n ) can be obtained if sup x ∈ X K ( x , x ) ≤ κ 2 . Then κ and B are the effective “dimensions”. 83 / 130

  78. Other Examples Using properties of Rademacher averages, we may establish guarantees for learning with neural networks, decision trees, and so on. Powerful technique, typically requires only a few lines of algebra. Occasionally, covering numbers and scale-sensitive dimensions can be easier to deal with. 84 / 130

  79. Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 85 / 130

  80. Real-Valued Functions: Covering Numbers Consider ▸ a class F of [ − 1, 1 ] -valued functions ▸ let Y = [ − 1, 1 ] , ℓ ( f ( x ) , y ) = ∣ f ( x ) − y ∣ We have L ( f ) ≤ 2 E x 1 ∶ n ̂ L ( f ) − ˆ R n (F) E sup f ∈F For real-valued functions the cardinality of F∣ x 1 ∶ n is infinite. However, similar functions f and f ′ with ( f ( x 1 ) , . . . , f ( x n )) ≈ ( f ′ ( x 1 ) , . . . , f ′ ( x n )) should be treated as the same. α 86 / 130

Recommend


More recommend