Theoretical Implications CS 535: Deep Learning Machine Learning - PowerPoint PPT Presentation

Theoretical Implications CS 535: Deep Learning

Machine Learning Theory: Basic setup • Generic supervised learning setup: • For 𝑦 𝑗 , 𝑧 𝑗 1…𝑜 i.i.d. drawn from the joint distribution 𝑄(𝑦, 𝑧) , find a best function 𝑔 ∈ 𝐺 that minimizes the error 𝐹 𝑦,𝑧 [𝑀 𝑔 𝑦 , 𝑧 ] • 𝑀 is a loss function, e.g. • Classification: 𝑀 𝑔 𝑦 , 𝑧 = ቊ1, 𝑔 𝑦 ≠ 𝑧 0, 𝑔 𝑦 = 𝑧 • Regression: 𝑀 𝑔 𝑦 , 𝑧 = 𝑔 𝑦 − 𝑧 2 • 𝐺 is a function class (consists many functions, e.g. all linear functions, all quadratic functions, all smooth functions, etc.)

Machine Learning Theory: Generalization • Machine learning theory is about generalizing to unseen examples • Not the training set error! • And those theory doesn’t always hold (holds with probability less than 1) • A generic machine learning generalization bound: • For 𝑦 𝑗 , 𝑧 𝑗 1…𝑜 drawn from the joint distribution 𝑄(𝑦, 𝑧) , How to represent with probability 1 − 𝜀 “flexibility”? That’s a 𝑜 course on ML theory 𝐹 𝑦,𝑧 𝑔 𝑦 ≠ 𝑧 ≤ 1 𝑜 ෍ 𝑀 𝑔 𝑦 𝑗 , 𝑧 𝑗 + Ω(𝐺, 𝜀) 𝑗=1 Error on the Error on the Flexibility of the whole distribution training set function class

What is “flexibility”? • Roughly, the more functions in 𝐺 , the more flexible it is • Function class: all linear functions 𝐺: {𝑔(𝑦)|𝑔 𝑦 = 𝑥 ⊤ 𝑦 + 𝑐} • Not very flexible, cannot even solve XOR • Small “flexibility” term, testing error not much more than training error • Function class: all 9-th degree polynomials ⊤ 𝑦 9 + ⋯ } 𝐺: {𝑔(𝑦)|𝑔 𝑦 = 𝑥 1 • Super flexible • Big “flexibility” term, testing error can be much more than training

Flexibility and overfitting • For a very flexible function class • Training error is NOT a good measure of testing error • Therefore, out-of-sample error estimates are needed • Separate validation set to measure the error • Cross-validation • K-fold • Leave-one-out • Many times this will show to be worse than the training error with a flexible function class

Another twist of the generalization inequality • Nevertheless, you still want training error to be small • So you don’t always want to use linear classifiers/ regressors If this is 60% error… Add-on term 𝑜 𝐹 𝑦,𝑧 𝑔 𝑦 ≠ 𝑧 ≤ 1 𝑜 ෍ 𝑀 𝑔 𝑦 𝑗 , 𝑧 𝑗 + Ω(𝐺, 𝜀) 𝑗=1 Error on the Error on the Flexibility of the function class whole distribution training set

How to deal with it when you do use a flexible function class • Regularization • To make the chance of choosing a highly flexible function to be low • Example: • Ridge Regression: 2 + 𝜇||𝑥|| 2 𝑥 ⊤ 𝑌 − 𝑍 min 𝑥 In order to choose a w with big ||𝑥|| 2 you need to overcome this term • Kernel SVM 𝑀(𝑔 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇||𝑔|| 2 min 𝑔 ෍ 𝑗 In order to choose a very unsmooth function f you need to overcome this term

Bayesian Interpretation of Regularization • Assume that a certain prior of the parameters exist, and optimize for the MAP estimate • Example: 2 ) • Ridge Regression: Gaussian prior on w: P w = C exp(−𝜇 𝑥 2 + 𝜇||𝑥|| 2 𝑥 ⊤ 𝑌 − 𝑍 min 𝑥 • Kernel SVM: Gaussian process prior on f (too complicated to explain simply..) 𝑀(𝑔 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇||𝑔|| 2 min 𝑔 ෍ 𝑗

Universal Approximators • Universal Approximators • (Barron 1994, Bartlett et al. 1999) Meaning that they can approximate (learn) any smooth function efficiently (meaning using a polynomial number of hidden units) • Kernel SVM • Neural Networks • Boosted Decision Trees • Machine learning cannot do much better • No free lunch theorem

No Free Lunch • (Wolpert 1996, Wolpert 2001) For any 2 learning algorithms, averaged over any training set d and over all possible distributions P, their average error is the same • Practical machine learning only works because of certain correct assumptions about the data • SVM succeeds by successfully representing the general smoothness assumption as a convex optimization problem (with global optimum) • However, if one goes for more complex assumptions, convexity is very hard to achieve!

High-dimensionality Philosophical discussion about high-dimensional spaces

Distance-based Algorithms • K-Nearest Neighbors: weighted average of k-nearest neighbors

Curse of Dimensionality • Dimensionality brings interesting effects: • In a 10-dim space, to cover 10% of the data in a unit cube, one needs a box to cover 80% of the range

High Dimensionality Facts • Every point is on the boundary • With N uniformly distributed points in a p-dimensional ball, the closest point to the origin has a median distance of • Every vector is almost always orthogonal to each other • Pick 2 unit vectors 𝑦 1 and 𝑦 2 , then the probability that log 𝑞 ⊤ 𝑦 2 | ≥ cos 𝑦 1 , 𝑦 2 = |𝑦 1 𝑞 is less than 1/𝑞

Avoiding the Curse • Regularization helps us with the curse • Smoothness constraints also grow stronger with the dimensionality! න |𝑔 ′ 𝑦 |𝑒𝑦 ≤ 𝐷 න 𝜖𝑔 𝑒𝑦 1 + න 𝜖𝑔 𝑒𝑦 2 + ⋯ + න 𝜖𝑔 𝑒𝑦 𝑞 ≤ 𝐷 𝜖𝑦 1 𝜖𝑦 2 𝜖𝑦 𝑞 • We do not suffer from the curse if we ONLY estimate sufficiently smooth functions!

Rademacher and Gaussian Complexity Why would CNN make sense

Rademacher and Gaussian Complexity

Risk Bound

Complexity Bound for NN

References • (Barron 1994) A. R. Barron (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, Vol.14, pp.113-143. • (Martin 1999) Martin A. and Bartlett P. Neural Network Learning: Theoretical Foundations 1st Edition • (Wolpert 1996) WOLPERT, David H., 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341 – 1390. • (Wolpert 2001) WOLPERT, David H., 2001. The supervised learning no-free- lunch theorems. In: Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications. • (Rahimi and Recht 2007) Rahimi A. and Recht B. Random Features for Large-Scale Kernel Machines. NIPS 2007.

Theoretical Implications CS 535: Deep Learning Machine Learning - PowerPoint PPT Presentation

Theoretical Implications CS 535: Deep Learning Machine Learning Theory: Basic setup Generic supervised learning setup: For , 1 i.i.d. drawn from the joint distribution (, ) , find a best function

Implications of global economic crisis Implications of global economic crisis Implications of

Theoretical physics and theoretical astrophysics John Campbell 2015 Institutional Review

Theoretical approaches to the many-body electronic problem: an introduction Lucia Reining

THEORETICAL PARTICLE PHYSICS IN KARLSRUHE I. The Team II. Research in Theoretical Particle

N "From a theoretical tool to the lab" Aline Ramires Institute for Theoretical Studies

SDR CLOUDS SDR CLOUDS RESOURCE MANAGEMENT RESOURCE MANAGEMENT IMPLICATIONS IMPLICATIONS INDEX

Implications Reading: EC 1.5 Peter J. Haas INFO 150 Fall Semester 2019 Lecture 4 1/ 19

Phase Measurements at Phase Measurements at the Theoretical Limit the Theoretical Limit Dominic

Mechanics International Collaboration Department Theoretical mechanics: research projects

Modeling Cell Signaling Bill Hlavacek Theoretical Biology & Biophysics Group Theoretical

Linguistics 101 Theoretical Syntax Theoretical Syntax When constructing sentences, our brains

EE 109 Unit 20 Theoretical Computer Science and Turing Machines Credit: Adapted from Gaurav

EE 109 Unit 20 Theoretical Computer Science and Turing Machines 2 Credit: Adapted from

Theoretical Uncertainties in Vector Theoretical Uncertainties in Vector Boson Production at the

Theoretical Physics Implications of LIGOs Gravitational Wave Observations Nicolas Yunes

Economic Implications of Economic Implications of Advancements in Radiation Advancements in

On the Quest for Flexible Modelling Esther Guerra, Juan de Lara MISO - Modelling & Software

On the flexibility of Kokotsakis meshes Hellmuth Stachel, Vienna University of Technology (joint

Promising State Policies for Personalized Learning Presenters: Maria Worthen, Vice President for

Fast and Flexible Difference Constraint Propagation for DPLL(T) Scott Cotton Oded Maler

MA Macroeconomics 9. Sticky Prices and the Phillips Curve Karl Whelan School of Economics, UCD

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1

Salus Fine-grained GPU Sharing Primitives for Deep Learning Applications Advisor: Mosharaf

Integrating Flexible Support for Security Policies into the Linux Operating System

Theoretical Implications CS 535: Deep Learning Machine Learning - PowerPoint PPT Presentation

Theoretical Implications CS 535: Deep Learning Machine Learning Theory: Basic setup Generic supervised learning setup: For , 1 i.i.d. drawn from the joint distribution (, ) , find a best function

Implications of global economic crisis Implications of global economic crisis Implications of

Theoretical physics and theoretical astrophysics John Campbell 2015 Institutional Review

Theoretical approaches to the many-body electronic problem: an introduction Lucia Reining

THEORETICAL PARTICLE PHYSICS IN KARLSRUHE I. The Team II. Research in Theoretical Particle

N &quot;From a theoretical tool to the lab&quot; Aline Ramires Institute for Theoretical Studies

SDR CLOUDS SDR CLOUDS RESOURCE MANAGEMENT RESOURCE MANAGEMENT IMPLICATIONS IMPLICATIONS INDEX

Implications Reading: EC 1.5 Peter J. Haas INFO 150 Fall Semester 2019 Lecture 4 1/ 19

Phase Measurements at Phase Measurements at the Theoretical Limit the Theoretical Limit Dominic

Mechanics International Collaboration Department Theoretical mechanics: research projects

Modeling Cell Signaling Bill Hlavacek Theoretical Biology &amp; Biophysics Group Theoretical

Linguistics 101 Theoretical Syntax Theoretical Syntax When constructing sentences, our brains

EE 109 Unit 20 Theoretical Computer Science and Turing Machines Credit: Adapted from Gaurav

EE 109 Unit 20 Theoretical Computer Science and Turing Machines 2 Credit: Adapted from

Theoretical Uncertainties in Vector Theoretical Uncertainties in Vector Boson Production at the

Theoretical Physics Implications of LIGOs Gravitational Wave Observations Nicolas Yunes

Economic Implications of Economic Implications of Advancements in Radiation Advancements in

On the Quest for Flexible Modelling Esther Guerra, Juan de Lara MISO - Modelling &amp; Software

On the flexibility of Kokotsakis meshes Hellmuth Stachel, Vienna University of Technology (joint

Promising State Policies for Personalized Learning Presenters: Maria Worthen, Vice President for

Fast and Flexible Difference Constraint Propagation for DPLL(T) Scott Cotton Oded Maler

MA Macroeconomics 9. Sticky Prices and the Phillips Curve Karl Whelan School of Economics, UCD

Scalable Frequent Sequence Mining With Flexible Subsequence Constraints Alexander Renz Wieland 1

Salus Fine-grained GPU Sharing Primitives for Deep Learning Applications Advisor: Mosharaf

Integrating Flexible Support for Security Policies into the Linux Operating System

N "From a theoretical tool to the lab" Aline Ramires Institute for Theoretical Studies

Modeling Cell Signaling Bill Hlavacek Theoretical Biology & Biophysics Group Theoretical

On the Quest for Flexible Modelling Esther Guerra, Juan de Lara MISO - Modelling & Software