Theoretical Implications CS 535: Deep Learning
Machine Learning Theory: Basic setup β’ Generic supervised learning setup: β’ For π¦ π , π§ π 1β¦π i.i.d. drawn from the joint distribution π(π¦, π§) , find a best function π β πΊ that minimizes the error πΉ π¦,π§ [π π π¦ , π§ ] β’ π is a loss function, e.g. β’ Classification: π π π¦ , π§ = α1, π π¦ β π§ 0, π π¦ = π§ β’ Regression: π π π¦ , π§ = π π¦ β π§ 2 β’ πΊ is a function class (consists many functions, e.g. all linear functions, all quadratic functions, all smooth functions, etc.)
Machine Learning Theory: Generalization β’ Machine learning theory is about generalizing to unseen examples β’ Not the training set error! β’ And those theory doesnβt always hold (holds with probability less than 1) β’ A generic machine learning generalization bound: β’ For π¦ π , π§ π 1β¦π drawn from the joint distribution π(π¦, π§) , How to represent with probability 1 β π βflexibilityβ? Thatβs a π course on ML theory πΉ π¦,π§ π π¦ β π§ β€ 1 π ΰ· π π π¦ π , π§ π + Ξ©(πΊ, π) π=1 Error on the Error on the Flexibility of the whole distribution training set function class
What is βflexibilityβ? β’ Roughly, the more functions in πΊ , the more flexible it is β’ Function class: all linear functions πΊ: {π(π¦)|π π¦ = π₯ β€ π¦ + π} β’ Not very flexible, cannot even solve XOR β’ Small βflexibilityβ term, testing error not much more than training error β’ Function class: all 9-th degree polynomials β€ π¦ 9 + β― } πΊ: {π(π¦)|π π¦ = π₯ 1 β’ Super flexible β’ Big βflexibilityβ term, testing error can be much more than training
Flexibility and overfitting β’ For a very flexible function class β’ Training error is NOT a good measure of testing error β’ Therefore, out-of-sample error estimates are needed β’ Separate validation set to measure the error β’ Cross-validation β’ K-fold β’ Leave-one-out β’ Many times this will show to be worse than the training error with a flexible function class
Another twist of the generalization inequality β’ Nevertheless, you still want training error to be small β’ So you donβt always want to use linear classifiers/ regressors If this is 60% errorβ¦ Add-on term π πΉ π¦,π§ π π¦ β π§ β€ 1 π ΰ· π π π¦ π , π§ π + Ξ©(πΊ, π) π=1 Error on the Error on the Flexibility of the function class whole distribution training set
How to deal with it when you do use a flexible function class β’ Regularization β’ To make the chance of choosing a highly flexible function to be low β’ Example: β’ Ridge Regression: 2 + π||π₯|| 2 π₯ β€ π β π min π₯ In order to choose a w with big ||π₯|| 2 you need to overcome this term β’ Kernel SVM π(π π¦ π , π§ π ) + π||π|| 2 min π ΰ· π In order to choose a very unsmooth function f you need to overcome this term
Bayesian Interpretation of Regularization β’ Assume that a certain prior of the parameters exist, and optimize for the MAP estimate β’ Example: 2 ) β’ Ridge Regression: Gaussian prior on w: P w = C exp(βπ π₯ 2 + π||π₯|| 2 π₯ β€ π β π min π₯ β’ Kernel SVM: Gaussian process prior on f (too complicated to explain simply..) π(π π¦ π , π§ π ) + π||π|| 2 min π ΰ· π
Universal Approximators β’ Universal Approximators β’ (Barron 1994, Bartlett et al. 1999) Meaning that they can approximate (learn) any smooth function efficiently (meaning using a polynomial number of hidden units) β’ Kernel SVM β’ Neural Networks β’ Boosted Decision Trees β’ Machine learning cannot do much better β’ No free lunch theorem
No Free Lunch β’ (Wolpert 1996, Wolpert 2001) For any 2 learning algorithms, averaged over any training set d and over all possible distributions P, their average error is the same β’ Practical machine learning only works because of certain correct assumptions about the data β’ SVM succeeds by successfully representing the general smoothness assumption as a convex optimization problem (with global optimum) β’ However, if one goes for more complex assumptions, convexity is very hard to achieve!
High-dimensionality Philosophical discussion about high-dimensional spaces
Distance-based Algorithms β’ K-Nearest Neighbors: weighted average of k-nearest neighbors
Curse of Dimensionality β’ Dimensionality brings interesting effects: β’ In a 10-dim space, to cover 10% of the data in a unit cube, one needs a box to cover 80% of the range
High Dimensionality Facts β’ Every point is on the boundary β’ With N uniformly distributed points in a p-dimensional ball, the closest point to the origin has a median distance of β’ Every vector is almost always orthogonal to each other β’ Pick 2 unit vectors π¦ 1 and π¦ 2 , then the probability that log π β€ π¦ 2 | β₯ cos π¦ 1 , π¦ 2 = |π¦ 1 π is less than 1/π
Avoiding the Curse β’ Regularization helps us with the curse β’ Smoothness constraints also grow stronger with the dimensionality! ΰΆ± |π β² π¦ |ππ¦ β€ π· ΰΆ± ππ ππ¦ 1 + ΰΆ± ππ ππ¦ 2 + β― + ΰΆ± ππ ππ¦ π β€ π· ππ¦ 1 ππ¦ 2 ππ¦ π β’ We do not suffer from the curse if we ONLY estimate sufficiently smooth functions!
Rademacher and Gaussian Complexity Why would CNN make sense
Rademacher and Gaussian Complexity
Risk Bound
Complexity Bound for NN
References β’ (Barron 1994) A. R. Barron (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, Vol.14, pp.113-143. β’ (Martin 1999) Martin A. and Bartlett P. Neural Network Learning: Theoretical Foundations 1st Edition β’ (Wolpert 1996) WOLPERT, David H., 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341 β 1390. β’ (Wolpert 2001) WOLPERT, David H., 2001. The supervised learning no-free- lunch theorems. In: Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications. β’ (Rahimi and Recht 2007) Rahimi A. and Recht B. Random Features for Large-Scale Kernel Machines. NIPS 2007.
Recommend
More recommend