theoretical implications
play

Theoretical Implications CS 535: Deep Learning Machine Learning - PowerPoint PPT Presentation

Theoretical Implications CS 535: Deep Learning Machine Learning Theory: Basic setup Generic supervised learning setup: For , 1 i.i.d. drawn from the joint distribution (, ) , find a best function


  1. Theoretical Implications CS 535: Deep Learning

  2. Machine Learning Theory: Basic setup β€’ Generic supervised learning setup: β€’ For 𝑦 𝑗 , 𝑧 𝑗 1β€¦π‘œ i.i.d. drawn from the joint distribution 𝑄(𝑦, 𝑧) , find a best function 𝑔 ∈ 𝐺 that minimizes the error 𝐹 𝑦,𝑧 [𝑀 𝑔 𝑦 , 𝑧 ] β€’ 𝑀 is a loss function, e.g. β€’ Classification: 𝑀 𝑔 𝑦 , 𝑧 = α‰Š1, 𝑔 𝑦 β‰  𝑧 0, 𝑔 𝑦 = 𝑧 β€’ Regression: 𝑀 𝑔 𝑦 , 𝑧 = 𝑔 𝑦 βˆ’ 𝑧 2 β€’ 𝐺 is a function class (consists many functions, e.g. all linear functions, all quadratic functions, all smooth functions, etc.)

  3. Machine Learning Theory: Generalization β€’ Machine learning theory is about generalizing to unseen examples β€’ Not the training set error! β€’ And those theory doesn’t always hold (holds with probability less than 1) β€’ A generic machine learning generalization bound: β€’ For 𝑦 𝑗 , 𝑧 𝑗 1β€¦π‘œ drawn from the joint distribution 𝑄(𝑦, 𝑧) , How to represent with probability 1 βˆ’ πœ€ β€œflexibility”? That’s a π‘œ course on ML theory 𝐹 𝑦,𝑧 𝑔 𝑦 β‰  𝑧 ≀ 1 π‘œ ෍ 𝑀 𝑔 𝑦 𝑗 , 𝑧 𝑗 + Ξ©(𝐺, πœ€) 𝑗=1 Error on the Error on the Flexibility of the whole distribution training set function class

  4. What is β€œflexibility”? β€’ Roughly, the more functions in 𝐺 , the more flexible it is β€’ Function class: all linear functions 𝐺: {𝑔(𝑦)|𝑔 𝑦 = π‘₯ ⊀ 𝑦 + 𝑐} β€’ Not very flexible, cannot even solve XOR β€’ Small β€œflexibility” term, testing error not much more than training error β€’ Function class: all 9-th degree polynomials ⊀ 𝑦 9 + β‹― } 𝐺: {𝑔(𝑦)|𝑔 𝑦 = π‘₯ 1 β€’ Super flexible β€’ Big β€œflexibility” term, testing error can be much more than training

  5. Flexibility and overfitting β€’ For a very flexible function class β€’ Training error is NOT a good measure of testing error β€’ Therefore, out-of-sample error estimates are needed β€’ Separate validation set to measure the error β€’ Cross-validation β€’ K-fold β€’ Leave-one-out β€’ Many times this will show to be worse than the training error with a flexible function class

  6. Another twist of the generalization inequality β€’ Nevertheless, you still want training error to be small β€’ So you don’t always want to use linear classifiers/ regressors If this is 60% error… Add-on term π‘œ 𝐹 𝑦,𝑧 𝑔 𝑦 β‰  𝑧 ≀ 1 π‘œ ෍ 𝑀 𝑔 𝑦 𝑗 , 𝑧 𝑗 + Ξ©(𝐺, πœ€) 𝑗=1 Error on the Error on the Flexibility of the function class whole distribution training set

  7. How to deal with it when you do use a flexible function class β€’ Regularization β€’ To make the chance of choosing a highly flexible function to be low β€’ Example: β€’ Ridge Regression: 2 + πœ‡||π‘₯|| 2 π‘₯ ⊀ π‘Œ βˆ’ 𝑍 min π‘₯ In order to choose a w with big ||π‘₯|| 2 you need to overcome this term β€’ Kernel SVM 𝑀(𝑔 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡||𝑔|| 2 min 𝑔 ෍ 𝑗 In order to choose a very unsmooth function f you need to overcome this term

  8. Bayesian Interpretation of Regularization β€’ Assume that a certain prior of the parameters exist, and optimize for the MAP estimate β€’ Example: 2 ) β€’ Ridge Regression: Gaussian prior on w: P w = C exp(βˆ’πœ‡ π‘₯ 2 + πœ‡||π‘₯|| 2 π‘₯ ⊀ π‘Œ βˆ’ 𝑍 min π‘₯ β€’ Kernel SVM: Gaussian process prior on f (too complicated to explain simply..) 𝑀(𝑔 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡||𝑔|| 2 min 𝑔 ෍ 𝑗

  9. Universal Approximators β€’ Universal Approximators β€’ (Barron 1994, Bartlett et al. 1999) Meaning that they can approximate (learn) any smooth function efficiently (meaning using a polynomial number of hidden units) β€’ Kernel SVM β€’ Neural Networks β€’ Boosted Decision Trees β€’ Machine learning cannot do much better β€’ No free lunch theorem

  10. No Free Lunch β€’ (Wolpert 1996, Wolpert 2001) For any 2 learning algorithms, averaged over any training set d and over all possible distributions P, their average error is the same β€’ Practical machine learning only works because of certain correct assumptions about the data β€’ SVM succeeds by successfully representing the general smoothness assumption as a convex optimization problem (with global optimum) β€’ However, if one goes for more complex assumptions, convexity is very hard to achieve!

  11. High-dimensionality Philosophical discussion about high-dimensional spaces

  12. Distance-based Algorithms β€’ K-Nearest Neighbors: weighted average of k-nearest neighbors

  13. Curse of Dimensionality β€’ Dimensionality brings interesting effects: β€’ In a 10-dim space, to cover 10% of the data in a unit cube, one needs a box to cover 80% of the range

  14. High Dimensionality Facts β€’ Every point is on the boundary β€’ With N uniformly distributed points in a p-dimensional ball, the closest point to the origin has a median distance of β€’ Every vector is almost always orthogonal to each other β€’ Pick 2 unit vectors 𝑦 1 and 𝑦 2 , then the probability that log π‘ž ⊀ 𝑦 2 | β‰₯ cos 𝑦 1 , 𝑦 2 = |𝑦 1 π‘ž is less than 1/π‘ž

  15. Avoiding the Curse β€’ Regularization helps us with the curse β€’ Smoothness constraints also grow stronger with the dimensionality! ΰΆ± |𝑔 β€² 𝑦 |𝑒𝑦 ≀ 𝐷 ΰΆ± πœ–π‘” 𝑒𝑦 1 + ΰΆ± πœ–π‘” 𝑒𝑦 2 + β‹― + ΰΆ± πœ–π‘” 𝑒𝑦 π‘ž ≀ 𝐷 πœ–π‘¦ 1 πœ–π‘¦ 2 πœ–π‘¦ π‘ž β€’ We do not suffer from the curse if we ONLY estimate sufficiently smooth functions!

  16. Rademacher and Gaussian Complexity Why would CNN make sense

  17. Rademacher and Gaussian Complexity

  18. Risk Bound

  19. Complexity Bound for NN

  20. References β€’ (Barron 1994) A. R. Barron (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, Vol.14, pp.113-143. β€’ (Martin 1999) Martin A. and Bartlett P. Neural Network Learning: Theoretical Foundations 1st Edition β€’ (Wolpert 1996) WOLPERT, David H., 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341 – 1390. β€’ (Wolpert 2001) WOLPERT, David H., 2001. The supervised learning no-free- lunch theorems. In: Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications. β€’ (Rahimi and Recht 2007) Rahimi A. and Recht B. Random Features for Large-Scale Kernel Machines. NIPS 2007.

Recommend


More recommend