On the Computational Complexity of Deep Learning Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem ”Optimization and Statistical Learning”, Les Houches, January 2014 Based on joint work with: Roi Livni and Ohad Shamir, Amit Daniely and Nati Linial, Tong Zhang Shalev-Shwartz (HU) DL OSL’15 1 / 35
PAC Learning Goal (informal): Learn an accurate mapping h : X → Y based on examples (( x 1 , y 1 ) , . . . , ( x n , y n )) ∈ ( X × Y ) n Shalev-Shwartz (HU) DL OSL’15 2 / 35
PAC Learning Goal (informal): Learn an accurate mapping h : X → Y based on examples (( x 1 , y 1 ) , . . . , ( x n , y n )) ∈ ( X × Y ) n PAC learning: Given H ⊂ Y X , probably approximately solve � � min ( x,y ) ∼D [ h ( x ) � = y ] , P h ∈H where D is unknown but the learner can sample ( x, y ) ∼ D Shalev-Shwartz (HU) DL OSL’15 2 / 35
What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum Shalev-Shwartz (HU) DL OSL’15 3 / 35
What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? Shalev-Shwartz (HU) DL OSL’15 3 / 35
What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? 3 Computational complexity How much computational time is needed to be ǫ -accurate ? Shalev-Shwartz (HU) DL OSL’15 3 / 35
What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? 3 Computational complexity How much computational time is needed to be ǫ -accurate ? No Free Lunch: If H = Y X then the sample complexity is Ω( |X| ) . Shalev-Shwartz (HU) DL OSL’15 3 / 35
What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? 3 Computational complexity How much computational time is needed to be ǫ -accurate ? No Free Lunch: If H = Y X then the sample complexity is Ω( |X| ) . Prior Knowledge: We must choose smaller H based on prior knowledge on D Shalev-Shwartz (HU) DL OSL’15 3 / 35
Prior Knowledge SVM and AdaBoost learn a halfspace on top of features, and most of the practical work is on finding good features Very strong prior knowledge x x Shalev-Shwartz (HU) DL OSL’15 4 / 35
Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Shalev-Shwartz (HU) DL OSL’15 5 / 35
Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Very expressive class Shalev-Shwartz (HU) DL OSL’15 5 / 35
Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Very expressive class Sample complexity ? Shalev-Shwartz (HU) DL OSL’15 5 / 35
Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Very expressive class Sample complexity ? Theorem H T is contained in the class of neural networks of depth O ( T ) and size O ( T 2 ) The sample complexity of this class is O ( T 2 ) Shalev-Shwartz (HU) DL OSL’15 5 / 35
The ultimate hypothesis class SVM: use prior knowledge to construct φ ( x ) and learn expert deep neural � w, φ ( x ) � system networks less prior knowledge more data No Free Lunch Shalev-Shwartz (HU) DL OSL’15 6 / 35
Neural Networks A single neuron with activation function σ : R → R x 1 v 1 x 2 v 2 v 3 x 3 σ ( � v, x � ) v 4 x 4 v 5 x 5 E.g., σ is a sigmoidal function Shalev-Shwartz (HU) DL OSL’15 7 / 35
Neural Networks A multilayer neural network of depth 3 and size 6 Input Hidden Hidden Output layer layer layer layer x 1 x 2 x 3 x 4 x 5 Shalev-Shwartz (HU) DL OSL’15 8 / 35
Brief history Neural networks were popular in the 70’s and 80’s Then, suppressed by SVM and Adaboost on the 90’s In 2006, several deep architectures with unsupervised pre-training have been proposed In 2012, Krizhevsky, Sutskever, and Hinton significantly improved state-of-the-art without unsupervised pre-training Since 2012, state-of-the-art in vision, speech, and more Shalev-Shwartz (HU) DL OSL’15 9 / 35
Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Shalev-Shwartz (HU) DL OSL’15 10 / 35
Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Empirical Risk Minimization (ERM): Sample S = (( x 1 , y 1 ) , . . . , ( x n , y n )) ∼ D n and approximately solve n 1 � min ℓ i ( w ) n w ∈ R d i =1 Shalev-Shwartz (HU) DL OSL’15 10 / 35
Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Empirical Risk Minimization (ERM): Sample S = (( x 1 , y 1 ) , . . . , ( x n , y n )) ∼ D n and approximately solve n 1 � min ℓ i ( w ) n w ∈ R d i =1 Realizable sample: ∃ w ∗ s.t. ∀ i, h w ∗ ( x i ) = y i Shalev-Shwartz (HU) DL OSL’15 10 / 35
Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Empirical Risk Minimization (ERM): Sample S = (( x 1 , y 1 ) , . . . , ( x n , y n )) ∼ D n and approximately solve n 1 � min ℓ i ( w ) n w ∈ R d i =1 Realizable sample: ∃ w ∗ s.t. ∀ i, h w ∗ ( x i ) = y i Blum and Rivest 1992: Distinguishing between realizable and unrealizable S is NP hard even for depth 2 networks with 3 hidden neurons (reduction to k coloring) Hence, solving the ERM problem is NP hard even under realizability Shalev-Shwartz (HU) DL OSL’15 10 / 35
Computational Complexity of Deep Learning The argument of Pitt and Valiant (1988) If it is NP-hard to distinguish realizable from un-realizable samples, then properly learning H is hard (unless RP=NP) Shalev-Shwartz (HU) DL OSL’15 11 / 35
Computational Complexity of Deep Learning The argument of Pitt and Valiant (1988) If it is NP-hard to distinguish realizable from un-realizable samples, then properly learning H is hard (unless RP=NP) Proof: Run the learning algorithm on the empirical distribution over the sample to get h ∈ H with empirical error < 1 /n : If ∀ i, h ( x i ) = y i , return “realizable” Otherwise, return “unrealizable” Shalev-Shwartz (HU) DL OSL’15 11 / 35
Improper Learning New search space Original search space Allow the learner to output h �∈ H Shalev-Shwartz (HU) DL OSL’15 12 / 35
Improper Learning New search space Original search space Allow the learner to output h �∈ H The argument of Pitt and Valiant fails because the algorithm may return consistent h even though S is unrealizable by H Shalev-Shwartz (HU) DL OSL’15 12 / 35
Improper Learning New search space Original search space Allow the learner to output h �∈ H The argument of Pitt and Valiant fails because the algorithm may return consistent h even though S is unrealizable by H Is deep learning still hard in the improper model ? Shalev-Shwartz (HU) DL OSL’15 12 / 35
Hope ... Generated examples in R 150 and passed them through a random depth- 2 network that contains 60 hidden neurons with the ReLU activation function. Tried to fit a new network to this data with over-specification factors of 1 , 2 , 4 , 8 4 1 2 4 3 8 MSE 2 1 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 · 10 5 #iterations Shalev-Shwartz (HU) DL OSL’15 13 / 35
How to show hardness of improper learning? The argument of Pitt and Valiant fails for improper learning because improper algorithms might perform well on unrealizable samples Shalev-Shwartz (HU) DL OSL’15 14 / 35
How to show hardness of improper learning? The argument of Pitt and Valiant fails for improper learning because improper algorithms might perform well on unrealizable samples Key Observation If a learning algorithm is computationally efficient its output must come from a class of “small” VC dimension Hence, it cannot perform well on “very random” samples Shalev-Shwartz (HU) DL OSL’15 14 / 35
Recommend
More recommend