The Sample-Computational Tradeoff Shai Shalev-Shwartz School of - PowerPoint PPT Presentation

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Optimization and Statistical Learning Workshop, des Houches, January 2013 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 1 / 32

Collaborators: Nati Srebro Ohad Shamir and Eran Tromer (AISTATS’2012) Satyen Kale and Elad Hazan (COLT’2012) Aharon Birnbaum (NIPS’2012) Amit Daniely and Nati Linial (on arxiv) Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 2 / 32

What else can we do with more data? Big data Traditional reduce error Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 3 / 32

What else can we do with more data? Big data Traditional compensate speedup runtime reduce error for missing information training prediction runtime runtime Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 3 / 32

Agnostic PAC Learning Hypothesis class H ⊂ Y X Loss function: ℓ : H × ( X × Y ) → R D - unknown distribution over X × Y True risk: L D ( h ) = E ( x,y ) ∼D [ ℓ ( h, ( x, y ))] Training set: S = ( x 1 , y 1 ) , . . . , ( x m , y m ) i.i.d. ∼ D m Goal: use S to find h S s.t. with high probability, L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ ERM rule: m L S ( h ) := 1 � ERM( S ) ∈ argmin ℓ ( h, ( x i , y i )) m h ∈H i =1 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 4 / 32

Error Decomposition h ⋆ = argmin L D ( h ) ; ERM( S ) = argmin L S ( h ) h ∈H h ∈H L D ( h ⋆ ) + L D (ERM( S )) − L D ( h ⋆ ) L D ( h S ) = � �� approximation estimation Bias-Complexity tradeoff: Larger H decreases approximation error but increases estimation error Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 5 / 32

3 -term Error Decomposition (Bottou & Bousquet’ 08) h ⋆ = argmin L D ( h ) ; ERM( S ) = argmin L S ( h ) h ∈H h ∈H L D ( h ⋆ ) + L D (ERM( S )) − L D ( h ⋆ ) L D ( h S ) = � �� approximation estimation + L D ( h S ) − L D (ERM( S )) � �� optimization Bias-Complexity tradeoff: Larger H decreases approximation error but increases estimation error What about optimization error ? Two resources: samples and runtime Sample-Computational complexity (Decatur, Goldreich, Ron ’98) Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 5 / 32

Joint Time-Sample Complexity Goal: L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32

Joint Time-Sample Complexity Goal: L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ Sample complexity: How many examples are needed ? Time complexity: How much time is needed ? Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32

Joint Time-Sample Complexity Goal: L D ( h S ) ≤ min h ∈H L D ( h ) + ǫ Sample complexity: How many examples are needed ? Time complexity: How much time is needed ? Time-sample complexity T H ,ǫ ( m ) = how much time is needed when | S | = m ? T H ,ǫ sample complexity data-laden m Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32

Outline The Sample-Computational tradeoff: Agnostic learning of preferences Learning margin-based halfspaces Formally establishing the tradeoff More data in partial information settings Other things we can do with more data Missing information Testing time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 7 / 32

Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Method I: ERM H d Sample complexity is ǫ 2 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Method I: ERM H d Sample complexity is ǫ 2 Varun Kanade and Thomas Steinke (2011): If RP � = NP, it is not possible to efficiently find an ǫ -accurate permutation Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

Agnostic learning Preferences The Learning Problem: X = [ d ] × [ d ] , Y = { 0 , 1 } Given ( i, j ) ∈ X predict if i is preferable over j H is all permutations over [ d ] Loss function = zero-one loss Method I: ERM H d Sample complexity is ǫ 2 Varun Kanade and Thomas Steinke (2011): If RP � = NP, it is not possible to efficiently find an ǫ -accurate permutation Claim: If m ≥ d 2 /ǫ 2 it is possible to find a predictor with error ≤ ǫ in polynomial time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

Agnostic learning Preferences Let H ( n ) be the set of all functions from X to Y ERM H ( n ) can be computed efficiently Sample complexity: V C ( H ( n ) ) /ǫ 2 = d 2 /ǫ 2 Improper learning H ( n ) H Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 9 / 32

Sample-Computational Tradeoff ERM H Samples Time ERM H d d ! d 2 d 2 ? ERM H ( n ) Time ERM H ( n ) Samples Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 10 / 32

Is this the best we can do? Analysis is based on upper bounds Is it possible to (improperly) learn efficiently with d log( d ) examples ? Posed as an open problem by: Jake Abernathy (COLT’10) Kleinberg, Niculescu-Mizil, Sharma (Machine Learning 2010) Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 11 / 32

Is this the best we can do? Analysis is based on upper bounds Is it possible to (improperly) learn efficiently with d log( d ) examples ? Posed as an open problem by: Jake Abernathy (COLT’10) Kleinberg, Niculescu-Mizil, Sharma (Machine Learning 2010) Hazan, Kale, S. (COLT’12): Can learn efficiently with d log 3 ( d ) examples ǫ 2 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 11 / 32

Sample-Computational Tradeoff ERM H Samples Time ERM H d d ! d 4 log 3 ( d ) d log 3 ( d ) HKS d 2 d 2 ERM H ( n ) Time HKS ERM H ( n ) Samples Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 12 / 32

HKS: Proof idea Each permutation π can be written as a matrix, s.t., � 1 if π ( i ) < π ( j ) W ( i, j ) = 0 o.w. Definition: A matrix is ( β, τ ) decomposable if its symmetrization can be written as P − N where P, N are PSD, have trace bounded by τ , and diagonal entries bounded by β Theorem: There’s an efficient online algorithm with regret of � τβ log( d ) T for predicting the elements of ( β, τ ) -decomposable matrices Lemma: Permutation matrices are (log( d ) , d log( d )) decomposable. Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 13 / 32

Outline The Sample-Computational tradeoff: Agnostic learning of preferences � Learning margin-based halfspaces Formally establishing the tradeoff Other things we can do with more data Missing information Testing time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 14 / 32

Learning Margin-Based Halfspaces Prior assumption: min w : � w � =1 P [ y � w, x � ≤ γ ] is small. γ Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 15 / 32

Learning Margin-Based Halfspaces Goal: Find h S : X → {± 1 } such that P [ h S ( x ) � = y ] ≤ (1 + α ) w : � w � =1 P [ y � w, x � ≤ γ ] + ǫ min Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 16 / 32

Learning Margin-Based Halfspaces Goal: Find h S : X → {± 1 } such that P [ h S ( x ) � = y ] ≤ (1 + α ) w : � w � =1 P [ y � w, x � ≤ γ ] + ǫ min Known results: α Samples Time 1 exp(1 /γ 2 ) Ben-David and Simon 0 γ 2 ǫ 2 1 1 SVM (Hinge-loss) poly(1 /γ ) γ 2 ǫ 2 γ Trading approximation factor for runtime What if α ∈ (0 , 1 /γ ) ? Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 16 / 32

Learning Margin-Based Halfspaces Theorem (Birnbaum and S., NIPS’12) Can achieve α -approximation using time and sample complexity of � � 4 poly(1 /γ ) · exp ( γ α ) 2 Corollary 1 γ √ Can achieve α = log(1 /γ ) in polynomial time Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 17 / 32

Proof Idea SVM relies on the hinge-loss as a convex surrogate: � � 1 − y � w,x � ℓ ( w, ( x, y )) = γ + γ -1 1 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32

Proof Idea SVM relies on the hinge-loss as a convex surrogate: � � 1 − y � w,x � ℓ ( w, ( x, y )) = γ + Compose the hinge-loss over a polynomial [1 − yp ( � w, x � )] + γ -1 1 Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of - PowerPoint PPT Presentation

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Optimization and Statistical Learning Workshop, des Houches, January 2013 Shai Shalev-Shwartz (Hebrew U)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

5. Structured Descriptions & Tradeoff Between Expressiveness and Tractability Outline

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Analysis of the Parallel Distinguished Point Tradeoff Jin Hong, *Ga Won Lee, Daegun Ma Seoul

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

Optimizing the Relevance-Redundancy Tradeoff for Efficient Semantic Segmentation Caner Hazrba

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

Sample and Hold Dag T. Wisland Spring 2014 Outline Sample and hold basics Non ideal

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Sample Score Report by three areas, or claims. Sample Score

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator Alp Yurtsever

This Class Weighted Majority Algorithm Mul+ple experts

Multiplicative Weights Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 13 :

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

Sambuz

Useful Links

Newsletter

Mail Us

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of - PowerPoint PPT Presentation

The Sample-Computational Tradeoff Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Optimization and Statistical Learning Workshop, des Houches, January 2013 Shai Shalev-Shwartz (Hebrew U)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

5. Structured Descriptions &amp; Tradeoff Between Expressiveness and Tractability Outline

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Analysis of the Parallel Distinguished Point Tradeoff Jin Hong, *Ga Won Lee, Daegun Ma Seoul

Optimal Communication-Distortion Tradeoff in Voting Debmalya Mandal (Columbia), Nisarg Shah

Optimizing the Relevance-Redundancy Tradeoff for Efficient Semantic Segmentation Caner Hazrba

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

Sample and Hold Dag T. Wisland Spring 2014 Outline Sample and hold basics Non ideal

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Sample Score Report by three areas, or claims. Sample Score

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator Alp Yurtsever

This Class Weighted Majority Algorithm Mul+ple experts

Multiplicative Weights Algorithms CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 13 :

Waiting for 6+ years Pete Beckman Argonne National Laboratory 2 Data from Peter

Highly Scalable Parallel Sorting Edgar Solomonik and Laxmikant Kale University of Illinois at

A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS,

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

Sambuz

Useful Links

Newsletter

Mail Us

5. Structured Descriptions & Tradeoff Between Expressiveness and Tractability Outline