The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Optimization and Statistical Learning Workshop, des Houches, January 2013 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 1 / 32
Collaborators: Nati Srebro Ohad Shamir and Eran Tromer (AISTATS’2012) Satyen Kale and Elad Hazan (COLT’2012) Aharon Birnbaum (NIPS’2012) Amit Daniely and Nati Linial (on arxiv) Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 2 / 32
What else can we do with more data? Big data Traditional reduce error Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 3 / 32
What else can we do with more data? Big data Traditional compensate speedup runtime reduce error for missing information training prediction runtime runtime Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 3 / 32
Agnostic PAC Learning Hypothesis class H ⊂ Y X Loss function: ℓ : H × ( X × Y ) → R D - unknown distribution over X × Y True risk: L D ( h ) = E ( x,y ) ∼D [ ℓ ( h, ( x, y ))] Training set: S = ( x 1 , y 1 ) , . . . , ( x m , y m ) i.i.d. ∼ D m Goal: use S to find h S s.t. with high probability, L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ ERM rule: m L S ( h ) := 1 � ERM( S ) ∈ argmin ℓ ( h, ( x i , y i )) m h ∈H i =1 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 4 / 32
Error Decomposition h ⋆ = argmin L D ( h ) ; ERM( S ) = argmin L S ( h ) h ∈H h ∈H L D ( h ⋆ ) + L D (ERM( S )) − L D ( h ⋆ ) L D ( h S ) = � �� � � �� � approximation estimation Bias-Complexity tradeoff: Larger H decreases approximation error but increases estimation error Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 5 / 32
3 -term Error Decomposition (Bottou & Bousquet’ 08) h ⋆ = argmin L D ( h ) ; ERM( S ) = argmin L S ( h ) h ∈H h ∈H L D ( h ⋆ ) + L D (ERM( S )) − L D ( h ⋆ ) L D ( h S ) = � �� � � �� � approximation estimation + L D ( h S ) − L D (ERM( S )) � �� � optimization Bias-Complexity tradeoff: Larger H decreases approximation error but increases estimation error What about optimization error ? Two resources: samples and runtime Sample-Computational complexity (Decatur, Goldreich, Ron ’98) Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 5 / 32
Joint Time-Sample Complexity Goal: L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32
Joint Time-Sample Complexity Goal: L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ Sample complexity: How many examples are needed ? Time complexity: How much time is needed ? Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32
Joint Time-Sample Complexity Goal: L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ Sample complexity: How many examples are needed ? Time complexity: How much time is needed ? Time-sample complexity T H ,ǫ ( m ) = how much time is needed when | S | = m ? T H ,ǫ sample complexity data-laden m Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32
Outline The Sample-Computational tradeoff: Agnostic learning of preferences Learning margin-based halfspaces Formally establishing the tradeoff More data in partial information settings Other things we can do with more data Missing information Testing time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 7 / 32
Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32
Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Method I: ERM H d Sample complexity is ǫ 2 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32
Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Method I: ERM H d Sample complexity is ǫ 2 Varun Kanade and Thomas Steinke (2011): If RP � = NP, it is not possible to efficiently find an ǫ -accurate permutation Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32
Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Method I: ERM H d Sample complexity is ǫ 2 Varun Kanade and Thomas Steinke (2011): If RP � = NP, it is not possible to efficiently find an ǫ -accurate permutation Claim: If m ≥ d 2 /ǫ 2 it is possible to find a predictor with error ≤ ǫ in polynomial time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32
Agnostic learning Preferences Let H ( n ) be the set of all functions from X to Y ERM H ( n ) can be computed efficiently Sample complexity: V C ( H ( n ) ) /ǫ 2 = d 2 /ǫ 2 Improper learning H ( n ) H Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 9 / 32
Sample-Computational Tradeoff ERM H Samples Time ERM H d d ! d 2 d 2 ? ERM H ( n ) Time ERM H ( n ) Samples Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 10 / 32
Is this the best we can do? Analysis is based on upper bounds Is it possible to (improperly) learn efficiently with d log( d ) examples ? Posed as an open problem by: Jake Abernathy (COLT’10) Kleinberg, Niculescu-Mizil, Sharma (Machine Learning 2010) Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 11 / 32
Is this the best we can do? Analysis is based on upper bounds Is it possible to (improperly) learn efficiently with d log( d ) examples ? Posed as an open problem by: Jake Abernathy (COLT’10) Kleinberg, Niculescu-Mizil, Sharma (Machine Learning 2010) Hazan, Kale, S. (COLT’12): Can learn efficiently with d log 3 ( d ) examples ǫ 2 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 11 / 32
Sample-Computational Tradeoff ERM H Samples Time ERM H d d ! d 4 log 3 ( d ) d log 3 ( d ) HKS d 2 d 2 ERM H ( n ) Time HKS ERM H ( n ) Samples Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 12 / 32
HKS: Proof idea Each permutation π can be written as a matrix, s.t., � 1 if π ( i ) < π ( j ) W ( i, j ) = 0 o.w. Definition: A matrix is ( β, τ ) decomposable if its symmetrization can be written as P − N where P, N are PSD, have trace bounded by τ , and diagonal entries bounded by β Theorem: There’s an efficient online algorithm with regret of � τβ log( d ) T for predicting the elements of ( β, τ ) -decomposable matrices Lemma: Permutation matrices are (log( d ) , d log( d )) decomposable. Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 13 / 32
Outline The Sample-Computational tradeoff: Agnostic learning of preferences � Learning margin-based halfspaces Formally establishing the tradeoff Other things we can do with more data Missing information Testing time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 14 / 32
Learning Margin-Based Halfspaces Prior assumption: min w : � w � =1 P [ y � w, x � ≤ γ ] is small. γ Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 15 / 32
Learning Margin-Based Halfspaces Goal: Find h S : X → {± 1 } such that P [ h S ( x ) � = y ] ≤ (1 + α ) w : � w � =1 P [ y � w, x � ≤ γ ] + ǫ min Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 16 / 32
Learning Margin-Based Halfspaces Goal: Find h S : X → {± 1 } such that P [ h S ( x ) � = y ] ≤ (1 + α ) w : � w � =1 P [ y � w, x � ≤ γ ] + ǫ min Known results: α Samples Time 1 exp(1 /γ 2 ) Ben-David and Simon 0 γ 2 ǫ 2 1 1 SVM (Hinge-loss) poly(1 /γ ) γ 2 ǫ 2 γ Trading approximation factor for runtime What if α ∈ (0 , 1 /γ ) ? Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 16 / 32
Learning Margin-Based Halfspaces Theorem (Birnbaum and S., NIPS’12) Can achieve α -approximation using time and sample complexity of � � 4 poly(1 /γ ) · exp ( γ α ) 2 Corollary 1 γ √ Can achieve α = log(1 /γ ) in polynomial time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 17 / 32
Proof Idea SVM relies on the hinge-loss as a convex surrogate: � � 1 − y � w,x � ℓ ( w, ( x, y )) = γ + γ -1 1 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32
Proof Idea SVM relies on the hinge-loss as a convex surrogate: � � 1 − y � w,x � ℓ ( w, ( x, y )) = γ + Compose the hinge-loss over a polynomial [1 − yp ( � w, x � )] + γ -1 1 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32
Recommend
More recommend