Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School - PowerPoint PPT Presentation

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem ”Algorithmic Learning Theory”, Lyon 2012 Joint work with: N. Srebro, O. Shamir, K. Sridharan (COLT’09,JMLR’11) A. Daniely, S. Sabato, S. Ben-David (COLT’11) A. Daniely, S. Sabato (NIPS’12) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 1 / 34

The Fundamental Theorem of Learning Theory For Binary Classification Uniform trivial trivial Learnable Learnable Convergence with ERM VC’71 NFL (W’96) Finite VC VC = Vapnik and Chervonenkis, W = Wolpert Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 2 / 34

The Fundamental Theorem of Learning Theory For Regression Uniform trivial trivial Learnable Learnable Convergence with ERM BLW’96,ABCH’97 Finite fat- KS’94,BLW’96,ABCH’97 shattering BLW = Bartlett, Long, Williamson. ABCH = Alon, Ben-David, Cesa-Bianchi, Hausler. KS = Kearns and Schapire Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 3 / 34

For general learning problems? Uniform trivial trivial Learnable Learnable Convergence with ERM ? Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

For general learning problems? Uniform trivial trivial Learnable Learnable Convergence with ERM X Not true Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

For general learning problems? Uniform trivial trivial Learnable Learnable Convergence with ERM X Not true Not true in “Convex learning problems” ! Not true even in “multiclass categorization” ! What is learnable ? How to learn ? Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

Outline Definitions 1 Learnability without uniform convergence 2 Characterizing Learnability using Stability 3 Characterizing Multiclass Learnability 4 Analyzing specific, practically relevant, classes 5 Open Questions 6 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 5 / 34

The General Learning Setting (Vapnik) Hypothesis class H Examples domain Z with unknown distribution D Loss function ℓ : H × Z → R Given: Training set S ∼ D m Goal: Solve: min h ∈H L ( h ) where L ( h ) = E z ∼D [ ℓ ( h, z )] in the P robably (w.p. ≥ 1 − δ ) A pproximately C orrect (up to ǫ ) sense Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 6 / 34

The General Learning Setting (Vapnik) Hypothesis class H Examples domain Z with unknown distribution D Loss function ℓ : H × Z → R Given: Training set S ∼ D m Goal: Solve: min h ∈H L ( h ) where L ( h ) = E z ∼D [ ℓ ( h, z )] in the P robably (w.p. ≥ 1 − δ ) A pproximately C orrect (up to ǫ ) sense m Training loss: L S ( h ) = 1 � ℓ ( h, z i ) m i =1 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 6 / 34

Examples Binary classification: Z = X × { 0 , 1 } h ∈ H is a predictor h : X → { 0 , 1 } ℓ ( h, ( x, y )) = 1 [ h ( x ) � = y ] Multiclass categorization: Z = X × Y h ∈ H is a predictor h : X → Y ℓ ( h, ( x, y )) = 1 [ h ( x ) � = y ] k -means clustering: Z = R d H ⊂ ( R d ) k specifies k cluster centers ℓ (( µ 1 , . . . , µ k ) , z ) = min j � µ j − z � Density Estimation: h is a parameter of a density p h ( z ) ℓ ( h, z ) = − log p h ( z ) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 7 / 34

Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Learnable : ∃A s.t. for m ≥ m PAC ( ǫ, δ ) , � � L ( A ( S )) ≤ min h ∈H L ( h ) + ǫ ≥ 1 − δ P S ∼D m Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Learnable : ∃A s.t. for m ≥ m PAC ( ǫ, δ ) , � � L ( A ( S )) ≤ min h ∈H L ( h ) + ǫ ≥ 1 − δ P S ∼D m ERM : An algorithm that returns A ( S ) ∈ argmin h ∈H L S ( h ) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Learnable : ∃A s.t. for m ≥ m PAC ( ǫ, δ ) , � � L ( A ( S )) ≤ min h ∈H L ( h ) + ǫ ≥ 1 − δ P S ∼D m ERM : An algorithm that returns A ( S ) ∈ argmin h ∈H L S ( h ) Learnable by arbitrary ERM (with rate m ERM ( ǫ, δ ) ) Like “Learnable” but A should be an ERM. Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

For Binary Classification Uniform trivial trivial Learnable Learnable Convergence with ERM VC’71 NFL (W’96) Finite VC VC( H ) log(1 /δ ) m UC ( ǫ, δ ) ≈ m ERM ( ǫ, δ ) ≈ m PAC ( ǫ, δ ) ≈ ǫ 2 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 9 / 34

Outline Definitions 1 Learnability without uniform convergence 2 Characterizing Learnability using Stability 3 Characterizing Multiclass Learnability 4 Analyzing specific, practically relevant, classes 5 Open Questions 6 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 10 / 34

Counter Example — Stochastic Convex Optimization Consider the family of problems: H is a convex set with max h ∈H � h � ≤ 1 For all z , ℓ ( h, z ) is convex and Lipschitz w.r.t. h Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 11 / 34

Counter Example — Stochastic Convex Optimization Consider the family of problems: H is a convex set with max h ∈H � h � ≤ 1 For all z , ℓ ( h, z ) is convex and Lipschitz w.r.t. h Claim: Problem is learnable by the rule: m 2 � h � 2 + 1 λ m � argmin ℓ ( h, z i ) m h ∈H i =1 No uniform convergence Not learnable by ERM Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 11 / 34

Counter Example — Stochastic Convex Optimization Proof (of “not learnable by arbitrary ERM”) 1 -Mean + missing features Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 12 / 34

Counter Example — Stochastic Convex Optimization Proof (of “not learnable by arbitrary ERM”) 1 -Mean + missing features z = ( α, x ) , α ∈ { 0 , 1 } d , x ∈ R d , � x � ≤ 1 �� i α i ( h i − x i ) 2 ℓ ( h, ( α, x )) = Take P [ α i = 1] = 1 / 2 , P [ x = µ ] = 1 Let h ( i ) be s.t. � 1 − µ j if j = i h ( i ) = j µ j o.w. If d is large enough, exists i such that h ( i ) is an ERM √ But L ( h ( i ) ) ≥ 1 / 2 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 12 / 34

Counter Example — Stochastic Convex Optimization Proof (of “not even learnable by a unique ERM”) Perturb the loss a little bit: �� α i ( h i − x i ) 2 + ǫ � 2 − i ( h i − 1) 2 ℓ ( h, ( α, x )) = i i Now loss is strictly convex — unique ERM But the unique ERM does not generalize (as before) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 13 / 34

For general learning problems? Uniform trivial trivial Learnable Learnable Convergence with ERM X Not true Not true in “Convex learning problems” ! ✓ Not true even in “multiclass categorization” ! Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 14 / 34

Counter Example — Multiclass X – a set, Y = { 0 , 1 , 2 , . . . , 2 |X| − 1 } Let n : 2 X → Y be defined by binary encoding H = { h T : T ⊂ X} where � 0 x / ∈ T h T ( x ) = n ( T ) x ∈ T Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

Counter Example — Multiclass X – a set, Y = { 0 , 1 , 2 , . . . , 2 |X| − 1 } Let n : 2 X → Y be defined by binary encoding H = { h T : T ⊂ X} where � 0 x / ∈ T h T ( x ) = n ( T ) x ∈ T Claim: No uniform convergence: m UC ≥ |X| /ǫ Target function is h ∅ For any training set S , take T = X \ S L S ( h T ) = 0 but L ( h T ) = P [ T ] Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

Counter Example — Multiclass X – a set, Y = { 0 , 1 , 2 , . . . , 2 |X| − 1 } Let n : 2 X → Y be defined by binary encoding H = { h T : T ⊂ X} where � 0 x / ∈ T h T ( x ) = n ( T ) x ∈ T Claim: H is Learnable: m PAC ≤ 1 ǫ Let T be the target A ( S ) = h T if ( x, n ( T )) ∈ S A ( S ) = h ∅ if S = { ( x 1 , 0) , . . . , ( x m , 0) } In the 1st case, L ( A ( S )) = 0 . In the 2nd case, L ( A ( S )) = P [ T ] With high probability, if P [ T ] > ǫ then we’ll be in the 1st case Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School - PowerPoint PPT Presentation

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Algorithmic Learning Theory, Lyon 2012 Joint work with: N. Srebro, O. Shamir, K. Sridharan

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The

Non Uniform Learnability prof. dr Arno Siebes Algorithmic Data Analysis Group Department of

Convergence of uniform subdivision Amos Ron Erice, Trapani, Sicilia, Italia, Europa September,

On the uniform convergence of Cesaro averages for C -dynamical systems Francesco Fidaleo

Local convergence for random permutations The case of uniform pattern-avoiding permutations

AvramParter and Szeg o limit theorems: from weak convergence to uniform approximation Egor

Homogenization and uniform resolvent convergence for elliptic operators in a strip perforated

Ideal convergence of nets of functions with values in uniform spaces A. C. Megaritis

When uniform weak convergence fails: Empirical processes for dependence functions and residuals

Uniform Convergence - Sample Complexity Assume that we want to estimate the probability p of

Uniform Convergence for Learning Binary Classifcation Given a concept class C , and a training

An experimental study of the learnability of congestion control Anirudh Sivaraman, Keith

Evaluating Learnability of - User interface and inline help - Inline/Online Tutorials Aim:

PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

On the rate of convergence of the Biggins martingale The rate of convergence Biggins martingale

Machine learning theory Nonuniform learnability Hamid Beigy Sharif university of technology

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Learnability and models of decision making under uncertainty Pathikrit Basu Federico Echenique

Asymptotic behaviour of large random stack-triangulations Marie Albenque et Jean-Franois

Some Thoughts on MC Convergence first, would like to define what I mean two kinds of

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Winter Uniform If out of uniform students must present a note of explanation to their Year Level

OVERVIEW 1 What is the Uniform Guidance? Rules that set uniform standards for the award and