Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem ”Mathematical and Computational Foundations of Learning Theory”, Dagstuhl 2011 Joint work with: N. Srebro, O. Shamir, K. Sridharan (COLT’09,JMLR’11) A. Daniely, S. Sabato, S. Ben-David (COLT’11) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 1 / 34
The Fundamental Theorem of Learning Theory For Binary Classification Uniform trivial trivial Learnable Learnable Convergence with ERM VC’71 Finite VC NFL (W’96) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 2 / 34
The Fundamental Theorem of Learning Theory For Regression Uniform trivial trivial Learnable Learnable Convergence with ERM Finite fat- BLW’96,ABCH’97 KS’94,BLW’96,ABCH’97 shattering Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 3 / 34
For general learning problems? Uniform trivial trivial Learnable Learnable Convergence with ERM ? Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 4 / 34
For general learning problems? Uniform trivial trivial Learnable Learnable Convergence with ERM X Not true even in multiclass classification ! What is learnable ? How to learn ? Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 4 / 34
Outline Definitions 1 Learnability without uniform convergence 2 Characterizing Learnability using Stability 3 Characterizing Multiclass Learnability 4 Open Questions 5 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 5 / 34
The General Learning Setting Vapnik’s General Learning Setting Hypothesis class H Instance space Z with unknown distribution D Loss function ℓ : H × Z → R Given: Training set S ∼ D m Goal: Probably approximately solve min h ∈H L ( h ) where L ( h ) = E z ∼D [ ℓ ( h, z )] Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 6 / 34
Examples Binary classification: Z = X × { 0 , 1 } h ∈ H is a predictor h : X → { 0 , 1 } ℓ ( h, ( x, y )) = 1 [ h ( x ) � = y ] Multiclass categorization: Z = X × Y h ∈ H is a predictor h : X → Y ℓ ( h, ( x, y )) = 1 [ h ( x ) � = y ] k -means clustering: Z = R d H ⊂ ( R d ) k specifies k cluster centers ℓ (( µ 1 , . . . , µ k ) , z ) = min j � µ j − z � Density Estimation: h is a parameter of a density p h ( z ) ℓ ( h, z ) = − log p h ( z ) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 7 / 34
Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) , S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 8 / 34
Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) , S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Learnable : ∃A s.t. for m ≥ m PAC ( ǫ, δ ) , � � L ( A ( S )) ≤ min ≥ 1 − δ P h ∈H L ( h ) + ǫ S ∼D m Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 8 / 34
Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) , S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Learnable : ∃A s.t. for m ≥ m PAC ( ǫ, δ ) , � � L ( A ( S )) ≤ min ≥ 1 − δ P h ∈H L ( h ) + ǫ S ∼D m ERM : An algorithm that returns A ( S ) ∈ argmin h ∈H L S ( h ) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 8 / 34
Learnability, ERM, Uniform convergence Uniform Convergence : For m ≥ m UC ( ǫ, δ ) , S ∼D m [ ∀ h ∈ H , | L S ( h ) − L ( h ) | ≤ ǫ ] ≥ 1 − δ P Learnable : ∃A s.t. for m ≥ m PAC ( ǫ, δ ) , � � L ( A ( S )) ≤ min ≥ 1 − δ P h ∈H L ( h ) + ǫ S ∼D m ERM : An algorithm that returns A ( S ) ∈ argmin h ∈H L S ( h ) Learnable by arbitrary ERM : Like “Learnable” but A should be an ERM. Denote sample complexity by m ERM ( ǫ, δ ) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 8 / 34
For Binary Classification Uniform trivial trivial Learnable Learnable Convergence with ERM VC’71 Finite VC NFL (W’96) VC( H ) log(1 /δ ) m UC ( ǫ, δ ) ≈ m ERM ( ǫ, δ ) ≈ m PAC ( ǫ, δ ) ≈ ǫ 2 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 9 / 34
Outline Definitions 1 Learnability without uniform convergence 2 Characterizing Learnability using Stability 3 Characterizing Multiclass Learnability 4 Open Questions 5 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 10 / 34
First (trivial) Counter Example Minorizing function: Let H ′ be a class of binary classifiers with infinite VC dimension Let H = H ′ ∪ { h 0 } 1 if h � = h 0 ∧ h ( x ) � = y Let ℓ ( h, ( x, y )) = 1 / 2 if h � = h 0 ∧ h ( x ) = y 0 if h = h 0 No uniform convergence ( m UC = ∞ ) Learnable by ERM ( m ERM = 0 ) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 11 / 34
From Vapnik’s book ... Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 12 / 34
Second Counter Example — Multiclass X – a set, Y = 2 X ∪ {∗} . H = { h T : T ⊂ X} where � ∗ x / ∈ T h T ( x ) = x ∈ T T Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 13 / 34
Second Counter Example — Multiclass X – a set, Y = 2 X ∪ {∗} . H = { h T : T ⊂ X} where � ∗ x / ∈ T h T ( x ) = x ∈ T T Claim: No uniform convergence: m UC ≥ |X| /ǫ Target function is h ∅ For any training set S , take T = X \ S L S ( h T ) = 0 but L ( h T ) = P [ T ] Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 13 / 34
Second Counter Example — Multiclass X – a set, Y = 2 X ∪ {∗} . H = { h T : T ⊂ X} where � ∗ ∈ T x / h T ( x ) = T x ∈ T Claim: H is Learnable: m PAC ≤ 1 ǫ Let T be the target A ( S ) = h T if ( x, T ) ∈ S A ( S ) = h ∅ if S = { ( x 1 , ∗ ) , . . . , ( x m , ∗ ) } In the 1st case, L ( A ( S )) = 0 . In the 2nd case, L ( A ( S )) = P [ T ] With high probability, if P [ T ] > ǫ then we’ll be in the 1st case Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 13 / 34
Second Counter Example — Multiclass Corollary m UC m PAC ≈ |X| . If |X| → ∞ then the problem is learnable but there is no uniform convergence! Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 14 / 34
Third Counter Example — Stochastic Convex Optimization Consider the family of problems: H is a convex set with max h ∈H � h � ≤ 1 For all z , ℓ ( h, z ) is convex and Lipschitz w.r.t. h Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 15 / 34
Third Counter Example — Stochastic Convex Optimization Consider the family of problems: H is a convex set with max h ∈H � h � ≤ 1 For all z , ℓ ( h, z ) is convex and Lipschitz w.r.t. h Claim: Problem is learnable by the rule: m 2 � h � 2 + 1 λ m � argmin ℓ ( h, z i ) m h ∈H i =1 No uniform convergence Not learnable by ERM Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 15 / 34
Third Counter Example — Stochastic Convex Optimization Proof (of “not learnable by arbitrary ERM”) 1 -Mean + missing features Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 16 / 34
Third Counter Example — Stochastic Convex Optimization Proof (of “not learnable by arbitrary ERM”) 1 -Mean + missing features z = ( α, x ) , α ∈ { 0 , 1 } d , x ∈ R d , � x � ≤ 1 �� i α i ( h i − x i ) 2 ℓ ( h, ( α, x )) = Take P [ α i = 1] = 1 / 2 , P [ x = µ ] = 1 Let h ( i ) be s.t. � 1 − µ j if j = i h ( i ) = j µ j o.w. If d is large enough, exists i such that h ( i ) is an ERM √ But L ( h ( i ) ) ≥ 1 / 2 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 16 / 34
Third Counter Example — Stochastic Convex Optimization Proof (of “not even learnable by a unique ERM”) Perturb the loss a little bit: �� α i ( h i − x i ) 2 + ǫ � 2 − i ( h i − 1) 2 ℓ ( h, ( α, x )) = i i Now loss is strictly convex — unique ERM But the unique ERM does not generalize (as before) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 17 / 34
Outline Definitions 1 Learnability without uniform convergence 2 Characterizing Learnability using Stability 3 Characterizing Multiclass Learnability 4 Open Questions 5 Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 18 / 34
Characterizing Learnability using Stability Theorem A sufficient and necessary condition for learnability is the existence of Asymptotic ERM (AERM) which is stable. RMP’05,MNPR’06, Uniform trivial ERM is stable ∃ stable AERM Convergence Learnable Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Jul’11 19 / 34
Recommend
More recommend