Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs - PowerPoint PPT Presentation

Introduction to Machine Learning Vapnik – Chervonenkis Theory Barnabás Póczos

Empirical Risk and True Risk 2

Empirical Risk Shorthand: True risk of f (deterministic) : Bayes risk : Let us use the empirical counter part: Empirical risk: 3

Empirical Risk Minimization Law of Large Numbers: Empirical risk is converging to the true risk 4

Overfitting in Classification with ERM Generative model: Bayes classifier: Bayes risk: Picture from David Pal 5

Overfitting in Classification with ERM n-order thresholded polynomials Empirical risk: Bayes risk: Picture from David Pal 6

Overfitting in Regression If we allow very complicated predictors, we could overfit the training data. Examples: Regression (Polynomial of order k-1 – degree k ) 1.5 1.4 k=1 k=2 1.2 linear constant 1 1 0.8 0.6 0.5 0.4 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.4 5 k=3 k=7 0 1.2 -5 quadratic 1 6 th order -10 0.8 -15 0.6 -20 -25 0.4 -30 0.2 -35 0 -40 -0.2 -45 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Solutions to Overfitting 8

Solutions to Overfitting Structural Risk Minimization Notation: Empirical risk Risk Goal: (Model error, Approximation error) Solution: Structural Risk Minimzation (SRM) 9

Big Picture Ultimate goal: Estimation error Approximation error Bayes risk Bayes risk Estimation error Bayes risk Approximation error 10

Effect of Model Complexity If we allow very complicated predictors, we could overfit the training data. fixed # training data Prediction error on training data Empirical risk is no longer a good indicator of true risk 11

Classification using the 0-1 loss 12

The Bayes Classifier Lemma I: Lemma II: Proofs: Lemma I: Trivial from definition Lemma II: Surprisingly long calculation 13

The Bayes Classifier This is what the learning algorithm produces We will need these definitions, please copy it! 14

The Bayes Classifier This is what the learning algorithm produces Theorem I: Bound on the Estimation error The true risk of what the learning algorithm produces 15

Proof of Theorem 1 Theorem I: Bound on the Estimation error The true risk of what the learning algorithm produces Proof:

The Bayes Classifier Theorem II: This is what the learning algorithm produces Proof: Trivial 17

Corollary Corollary: Main message: It’s enough to derive upper bounds for 18

Illustration of the Risks 19

It’s enough to derive upper bounds for It is a random variable that we need to bound! We will bound it with tail bounds! 20

Hoeffding’s inequality (1963) Special case 21

Binomial distributions Our goal is to bound Bernoulli(p) Therefore, from Hoeffding we have: Yuppie!!! 22

Inversion From Hoeffding we have: Therefore, 23

Union Bound Our goal is to bound: We already know: Theorem : [tail bound on the ‘deviation’ in the worst case] Worst case error This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk! Proof : 24

Inversion of Union Bound We already know: Therefore, 25

Inversion of Union Bound • The larger the N , the looser the bound • This results is distribution free: True for all P(X,Y) distributions • It is useless if N is big, or infinite… (e.g. all possible hyperplanes) It can be fixed with McDiarmid inequality and VC dimension… 26

Concentration and Expected Value n 27

The Expected Error Our goal is to bound: We already know: (Tail bound, Concentration inequality) Theorem : [Expected ‘deviation’ in the worst case] Worst case deviation This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk! Proof: we already know a tail bound. (From that actually we get a bit weaker inequality… oh well) 28

Function classes with infinite many elements

McDiarmid’s Bounded Difference Inequality It follows that 30

Bounded Difference Condition Our main goal is to bound Lemma : Proof: Let g denote the following function: Observation: => McDiarmid can be applied for g! 31

Bounded Difference Condition Corollary: The Vapnik-Chervonenkis inequality does that with the shatter coefficient (and VC dimension)! 32

Vapnik-Chervonenkis inequality Our main goal is to bound We already know: Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: 33

Shattering 34

How many points can a linear boundary classify exactly in 1D? 2 pts 3 pts - + + - + - - + - + - - + There exists placement s.t. ?? all labelings can be classified The answer is 2 35

How many points can a linear boundary classify exactly in 2D? - + 3 pts 4 pts - + + - + - + - + - + ?? - There exists a placement s.t. all labelings can be classified No matter how we place 4 points, The answer is 3 there is a labeling that cannot be classified 36

How many points can a linear boundary classify exactly in 3D? The answer is 4 + + - tetraeder - How many points can a linear boundary classify exactly in d-dim? The answer is d+1 37

Growth function, Shatter coefficient 0 0 0 0 1 0 1 1 1 1 0 0 Definition 0 1 1 (=5 in this example) 0 1 0 1 1 1 Growth function, Shatter coefficient maximum number of behaviors on n points 38

Growth function, Shatter coefficient - Definition + Growth function, Shatter coefficient + maximum number of behaviors on n points Example: Half spaces in 2D - + + 39

VC-dimension Definition # behaviors Growth function, Shatter coefficient maximum number of behaviors on n points Definition: VC-dimension Definition: Shattering Note: 40

VC-dimension Definition # behaviors 41

VC-dimension - + (such that you want to maximize the # of different behaviors) + - 42

Examples 43

VC dim of decision stumps ( axis aligned linear separator) in 2d What’s the VC dim. of decision stumps in 2d? - - + + + + + - - There is a placement of 3 pts that can be shattered => VC dim ≥ 3 44

VC dim of decision stumps ( axis aligned linear separator) in 2d What’s the VC dim. of decision stumps in 2d? If VC dim = 3, then for all placements of 4 pts, there exists a labeling that can’t be shattered 1 in convex quadrilateral 3 collinear hull of other 3 - - + - - + - + + - - => VC dim = 3 45

VC dim. of axis parallel rectangles in 2d What’s the VC dim. of axis parallel rectangles in 2d? - - + + + - There is a placement of 3 pts that can be shattered => VC dim ≥ 3 46

VC dim. of axis parallel rectangles in 2d There is a placement of 4 pts that can be shattered ) VC dim ≥ 4 47

VC dim. of axis parallel rectangles in 2d What’s the VC dim. of axis parallel rectangles in 2d? If VC dim = 4, then for all placements of 5 pts, there exists a labeling that can’t be shattered pentagon 4 collinear 2 in convex hull - 1 in convex hull - + - - - - - + + + - + - - - + - - ) VC dim = 4 48

Sauer’s Lemma [Exponential in n ] We already know that Sauer’s lemma: The VC dimension can be used to upper bound the shattering coefficient. [Polynomial in n ] Corollary: 49

Vapnik-Chervonenkis inequality [We don’t prove this] Vapnik-Chervonenkis inequality: From Sauer’s lemma: Since Therefore, Estimation error 50

Linear (hyperplane) classifiers We already know that Estimation error Estimation error Estimation error 51

Vapnik-Chervonenkis Theorem We already know from McDiarmid: Vapnik-Chervonenkis inequality: [We don’t prove them] Corollary: Vapnik-Chervonenkis theorem: We already know: Hoeffding + Union bound for finite function class: 52

PAC Bound for the Estimation Error VC theorem: Inversion: Estimation error 53

What you need to know Complexity of the classifier depends on number of points that can be classified exactly Finite case – Number of hypothesis Infinite case – Shattering coefficient, VC dimension

Thanks for your attention ☺ 55

Proof of Sauer’s Lemma Write all different behaviors on a sample (x 1 ,x 2 ,…x n ) in a matrix : 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 VC dim =2 57

Proof of Sauer’s Lemma Shattered subsets of columns: 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 We will prove that Therefore, In this example: 5· 1+3+3=7, since VC=2, n=3 58

Proof of Sauer’s Lemma Shattered subsets of columns: 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 Lemma 1 In this example: 6· 1+3+3=7 Lemma 2 for any binary matrix with no repeated rows. In this example: 5· 6 59

Proof of Lemma 1 Shattered subsets of columns: 0 0 0 0 1 0 1 1 1 1 0 0 In this example: 6· 1+3+3=7 0 1 1 Lemma 1 Proof Q.E.D. 60

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs - PowerPoint PPT Presentation

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic) : Bayes risk : Let us use the empirical counter part: Empirical risk: 3

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Parsing transcripts of speech Andrew Caines 1 , Michael McCarthy 2 & Paula Buttery 1 1

TOWARDS AN OPERATIONAL MODEL OF SCHEDULES IN TENSOR EXPRESSION Yuan Lin and Yongfeng Gu

TMD parton distributions and saturation Cyrille Marquet Institut de Physique Thorique

T o Scale Chinas Crowdfunding: a lab approach oach Xiaochen Zhang Founding Partner, New

Introduction to Machine Learning Risk Minimization Barnabs Pczos What have we seen so far?

CSCI261E/F Lecture 19: Classes & Objects November 1, 2010 ? Programs & Things

IMPERFECT DUTIES AND SUPEREROGATION Matthias Brinkmann matthias.brinkmann@philosophy.ox.ac.uk

Selling Bits: Selling Bits: A Matter of Creating A Matter of Creating Consumer Value Consumer

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs - PowerPoint PPT Presentation

Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabs Pczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic) : Bayes risk : Let us use the empirical counter part: Empirical risk: 3

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Parsing transcripts of speech Andrew Caines 1 , Michael McCarthy 2 &amp; Paula Buttery 1 1

TOWARDS AN OPERATIONAL MODEL OF SCHEDULES IN TENSOR EXPRESSION Yuan Lin and Yongfeng Gu

TMD parton distributions and saturation Cyrille Marquet Institut de Physique Thorique

T o Scale Chinas Crowdfunding: a lab approach oach Xiaochen Zhang Founding Partner, New

Introduction to Machine Learning Risk Minimization Barnabs Pczos What have we seen so far?

CSCI261E/F Lecture 19: Classes &amp; Objects November 1, 2010 ? Programs &amp; Things

IMPERFECT DUTIES AND SUPEREROGATION Matthias Brinkmann matthias.brinkmann@philosophy.ox.ac.uk

Selling Bits: Selling Bits: A Matter of Creating A Matter of Creating Consumer Value Consumer

Parsing transcripts of speech Andrew Caines 1 , Michael McCarthy 2 & Paula Buttery 1 1

CSCI261E/F Lecture 19: Classes & Objects November 1, 2010 ? Programs & Things