Introduction to Machine Learning 13. Learning Theory Geoff Gordon - PowerPoint PPT Presentation

Introduction to Machine Learning 13. Learning Theory Geoff Gordon and Alex Smola Carnegie Mellon University � http://alex.smola.org/teaching/cmu2013-10-701x 10-701

The Problem • Training • Data drawn iid from { ( x 1 , y 1 ) , . . . ( x m , y m ) } p ( x, y ) • Loss function l ( x, y, f ( x )) • Function class F = { f : Ω [ f ] ≤ c } • Empirical risk minimization problem m 1 X � minimize l ( x i , y i , f ( x i )) m f ∈ F i =1 � • Testing ( x,y ) ∼ p ( x,y ) [ l ( x, y, f ( x ))] E

classifier   (polynomial regression)

linear classifier (underfitting)

quadratic classifier

Typical behavior error model complexity

Typical behavior error training error model complexity

Typical behavior error test error training error model complexity

Typical behavior error test error How do we find this? training error model complexity

  A broken reasoning • Hoeffding bound for bounded random variable   − 2 m ✏ 2 ✓ ◆ Pr ( | ˆ µ m − µ | > ✏ ) ≤ 2 exp . c 2 • Function that minimizes empirical risk f ∗ • Bounded risk by L • Apply bound to get with high probability p ✏ ≤ L (log 2 / � ) / 2 m � • Why does our bound diverge in reality?

Multiple testing • Tossing an unbiased coin 10 times 7 5.25 3.5 1.75 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 best ‘strategy’

Multiple testing • Tossing an unbiased coin 100 times 70 52.5 35 17.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 best ‘strategy’

Multiple testing • We invoke the bound each time we test • Picking the best out of N gives us N opportunities to get it wrong! • Union bound X � Pr {| R emp [ f 0 ] − R [ f 0 ] | > ✏ } Pr {| R emp [ f ] − R [ f ] | > ✏ } ≤ f 0 2 F � • Testing over all functions in function class • Split error probability up among all functions • Take supremum over all terms

Multiple testing • Our first generalization bound r � log | F | + log 2 / � ✏ ≤ L 2 m � • Putting it all together r log | F | + log 2 / δ � R [ f ∗ ] ≤ inf f ∈ F R emp [ f ] + L 2 m � • What if function class is not discrete? • What if we have binary loss

Covering Numbers • What if we have an uncountable function class? • Approximate by finite cover

Covering Numbers • What if we have an uncountable function class? • Approximate by finite cover • Now bound depends on discretization, too

Covering Numbers • Approximation error ✏ • Covering number (actually need metric) N ( F , ✏ ) r log N ( F , ✏ ) + log 2 / � R [ f ⇤ ] ≤ inf + L 0 ✏ f 2 F R emp [ f ] + L 2 m

VC Dimension • Binary classification problem • Given locations, enumerate all possible ways these points can be separated • Example - linear separation

VC Dimension • Binary classification problem • Given locations, enumerate all possible ways these points can be separated • Exponential growth to VCD, then polynomial r � h (log(2 m/h ) + 1) + log 4 / δ R [ f ∗ ] ≤ inf f ∈ F R emp [ f ] + m � • Examples • d-dimensional linear functions have h=d • has infinite h sin( x/w )

VC Dimension • Binary classification problem • Given locations, enumerate all possible ways these points can be separated • Exponential growth to VCD, then polynomial r � h (log(2 m/h ) + 1) + log 4 / δ R [ f ∗ ] ≤ inf f ∈ F R emp [ f ] + m � • Examples polynomial growth • d-dimensional linear functions have h=d • has infinite h sin( x/w )

Rademacher Averages • Nontrivial bound (state of the art) • Reasonably easy to compute • Recall McDiarmid’s inequality − 2 ✏ 2 C − 2 � � Pr ( | f ( x 1 , . . . , x m ) − E X 1 ,...,X m [ f ( x 1 , . . . , x m )] | > ✏ ) ≤ 2 exp . � | f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x 0 i , . . . , x m ) | ≤ c i � m X � C 2 = c 2 i • Bound worst case deviation i =1 � � ( m ) 1 � � X Pr sup l ( x i , y i , f ( x i )) − E ( x,y ) [ l ( x, y, f ( x ))] � > ✏ � � m � � f ∈ F � i =1

Rademacher Averages • Worst case deviation � � m 1 � � � X Ξ ( X, Y ) := sup l ( x i , y i , f ( x i )) − E ( x,y ) [ l ( x, y, f ( x ))] � � m � � f ∈ F � � i =1 � • If we change single observation pair � Ξ ( X, Y ) − Ξ ( X � i ∪ { x 0 i } , Y � i ∪ { y 0 � ≤ L/m � � i } ) � • Apply McDiarmid’s bound to get − 2 m ✏ 2 L − 2 � � Pr {| Ξ ( X, Y ) > E X,Y [ Ξ ( X, Y )] | > ✏ } ≤ 2 exp � • Worst case deviation not far from typical case

Rademacher Averages � � " m # 1 � � X sup l ( x i , y i , f ( x i )) − E ( x,y ) [ l ( x, y, f ( x ))] E X,Y � � � m � f 2 F � � i =1 � � " m m # 1 l ( x i , y i , f ( x i )) − E X 0 ,Y 0 1 � � X X [ l ( x 0 i , y 0 i , f ( x 0 = E X,Y sup i ))] � � m m � � f 2 F � � i =1 i =1 � � " m # 1 � � X [ l ( x i , y i , f ( x i )) − l ( x 0 i , y 0 i , f ( x 0 sup i ))] ≤ E X,Y,X 0 ,Y 0 � � m � � f 2 F � � i =1 � � " # m 1 � � X σ i [ l ( x i , y i , f ( x i )) − l ( x 0 i , y 0 i , f ( x 0 = E X,Y,X 0 ,Y 0 E σ sup i ))] � � m � � f 2 F � � i =1 " # m ≤ 2 X sup σ i l ( x i , y i , f ( x i )) m E X,Y E σ f 2 F i =1

Rademacher Averages • Putting it all together r log 2 / δ � R [ f ] ≤ R emp [ f ] + 2 R [ F , m ] + L 2 m � averaging � behavior for � random labels � • Rademacher average can be bounded easily for linear function classes by solving a convex optimization problem.

Some Alternatives • Validation set • Train on training set (e.g. 90% of the data) • Check performance on remaining 10% • Use only if dataset is huge and few tests • Crossvalidation • Average over validation sets (e.g. 10 fold) • Nested cross-validation for model selection   (e.g. 10-fold in each fold to find parameters) • Bayesian statistics

Introduction to Machine Learning 13. Learning Theory Geoff Gordon - PowerPoint PPT Presentation

Introduction to Machine Learning 13. Learning Theory Geoff Gordon and Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701 The Problem Training Data drawn iid

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Developing an Effective e-Learning Infrastructure: Ends, Means, and Processes Anthony Chow,

Compact-2D: A Physical Design Methodology to Build Commercial-Quality F2F-Bonded 3D ICs Bon

Blended Instruction with a ! Ozlem Elgun Tillman - QRC ! Blended Team: ! ! Joanna Deszcz - QRC ! !

Getting Started w ith WoT project @Osaka F2F W3C WoT WG Co-chair: Kazuo Kajimoto (Panasonic) 1

In-Person Meeting January 24 th -25 th , 2018 The Task Force for Global Health (Decatur, GA)

2020 Poster Slam Richard Vath, Session Facilitator Saturday, March 28 th AIAMC Annual Meeting

KAGRA future discussion 1 The 3 rd KAGRA International Workshop May 2017 Tokyo Institute of

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob

Sambuz

Useful Links

Newsletter

Mail Us