On the Computational Complexity of Deep Learning Shai Shalev-Shwartz - PowerPoint PPT Presentation

On the Computational Complexity of Deep Learning Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem ”Optimization and Statistical Learning”, Les Houches, January 2014 Based on joint work with: Roi Livni and Ohad Shamir, Amit Daniely and Nati Linial, Tong Zhang Shalev-Shwartz (HU) DL OSL’15 1 / 35

PAC Learning Goal (informal): Learn an accurate mapping h : X → Y based on examples (( x 1 , y 1 ) , . . . , ( x n , y n )) ∈ ( X × Y ) n Shalev-Shwartz (HU) DL OSL’15 2 / 35

PAC Learning Goal (informal): Learn an accurate mapping h : X → Y based on examples (( x 1 , y 1 ) , . . . , ( x n , y n )) ∈ ( X × Y ) n PAC learning: Given H ⊂ Y X , probably approximately solve � � min ( x,y ) ∼D [ h ( x ) � = y ] , P h ∈H where D is unknown but the learner can sample ( x, y ) ∼ D Shalev-Shwartz (HU) DL OSL’15 2 / 35

What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum Shalev-Shwartz (HU) DL OSL’15 3 / 35

What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? Shalev-Shwartz (HU) DL OSL’15 3 / 35

What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? 3 Computational complexity How much computational time is needed to be ǫ -accurate ? Shalev-Shwartz (HU) DL OSL’15 3 / 35

What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? 3 Computational complexity How much computational time is needed to be ǫ -accurate ? No Free Lunch: If H = Y X then the sample complexity is Ω( |X| ) . Shalev-Shwartz (HU) DL OSL’15 3 / 35

What should be H ? � � min ( x,y ) ∼D [ h ( x ) � = y ] P h ∈H 1 Expressiveness Larger H ⇒ smaller minimum 2 Sample complexity How many samples are needed to be ǫ -accurate? 3 Computational complexity How much computational time is needed to be ǫ -accurate ? No Free Lunch: If H = Y X then the sample complexity is Ω( |X| ) . Prior Knowledge: We must choose smaller H based on prior knowledge on D Shalev-Shwartz (HU) DL OSL’15 3 / 35

Prior Knowledge SVM and AdaBoost learn a halfspace on top of features, and most of the practical work is on finding good features Very strong prior knowledge x x Shalev-Shwartz (HU) DL OSL’15 4 / 35

Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Shalev-Shwartz (HU) DL OSL’15 5 / 35

Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Very expressive class Shalev-Shwartz (HU) DL OSL’15 5 / 35

Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Very expressive class Sample complexity ? Shalev-Shwartz (HU) DL OSL’15 5 / 35

Weaker prior knowledge Let H T be all functions from { 0 , 1 } p → { 0 , 1 } that can be implemented by a Turing machine using at most T operations. Very expressive class Sample complexity ? Theorem H T is contained in the class of neural networks of depth O ( T ) and size O ( T 2 ) The sample complexity of this class is O ( T 2 ) Shalev-Shwartz (HU) DL OSL’15 5 / 35

The ultimate hypothesis class SVM: use prior knowledge to construct φ ( x ) and learn expert deep neural � w, φ ( x ) � system networks less prior knowledge more data No Free Lunch Shalev-Shwartz (HU) DL OSL’15 6 / 35

Neural Networks A single neuron with activation function σ : R → R x 1 v 1 x 2 v 2 v 3 x 3 σ ( � v, x � ) v 4 x 4 v 5 x 5 E.g., σ is a sigmoidal function Shalev-Shwartz (HU) DL OSL’15 7 / 35

Neural Networks A multilayer neural network of depth 3 and size 6 Input Hidden Hidden Output layer layer layer layer x 1 x 2 x 3 x 4 x 5 Shalev-Shwartz (HU) DL OSL’15 8 / 35

Brief history Neural networks were popular in the 70’s and 80’s Then, suppressed by SVM and Adaboost on the 90’s In 2006, several deep architectures with unsupervised pre-training have been proposed In 2012, Krizhevsky, Sutskever, and Hinton significantly improved state-of-the-art without unsupervised pre-training Since 2012, state-of-the-art in vision, speech, and more Shalev-Shwartz (HU) DL OSL’15 9 / 35

Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Shalev-Shwartz (HU) DL OSL’15 10 / 35

Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Empirical Risk Minimization (ERM): Sample S = (( x 1 , y 1 ) , . . . , ( x n , y n )) ∼ D n and approximately solve n 1 � min ℓ i ( w ) n w ∈ R d i =1 Shalev-Shwartz (HU) DL OSL’15 10 / 35

Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Empirical Risk Minimization (ERM): Sample S = (( x 1 , y 1 ) , . . . , ( x n , y n )) ∼ D n and approximately solve n 1 � min ℓ i ( w ) n w ∈ R d i =1 Realizable sample: ∃ w ∗ s.t. ∀ i, h w ∗ ( x i ) = y i Shalev-Shwartz (HU) DL OSL’15 10 / 35

Computational Complexity of Deep Learning By fixing an architecture of a network (underlying graph and activation functions), each network is parameterized by a weight vector w ∈ R d , so our goal is to learn the vector w Empirical Risk Minimization (ERM): Sample S = (( x 1 , y 1 ) , . . . , ( x n , y n )) ∼ D n and approximately solve n 1 � min ℓ i ( w ) n w ∈ R d i =1 Realizable sample: ∃ w ∗ s.t. ∀ i, h w ∗ ( x i ) = y i Blum and Rivest 1992: Distinguishing between realizable and unrealizable S is NP hard even for depth 2 networks with 3 hidden neurons (reduction to k coloring) Hence, solving the ERM problem is NP hard even under realizability Shalev-Shwartz (HU) DL OSL’15 10 / 35

Computational Complexity of Deep Learning The argument of Pitt and Valiant (1988) If it is NP-hard to distinguish realizable from un-realizable samples, then properly learning H is hard (unless RP=NP) Shalev-Shwartz (HU) DL OSL’15 11 / 35

Computational Complexity of Deep Learning The argument of Pitt and Valiant (1988) If it is NP-hard to distinguish realizable from un-realizable samples, then properly learning H is hard (unless RP=NP) Proof: Run the learning algorithm on the empirical distribution over the sample to get h ∈ H with empirical error < 1 /n : If ∀ i, h ( x i ) = y i , return “realizable” Otherwise, return “unrealizable” Shalev-Shwartz (HU) DL OSL’15 11 / 35

Improper Learning New search space Original search space Allow the learner to output h �∈ H Shalev-Shwartz (HU) DL OSL’15 12 / 35

Improper Learning New search space Original search space Allow the learner to output h �∈ H The argument of Pitt and Valiant fails because the algorithm may return consistent h even though S is unrealizable by H Shalev-Shwartz (HU) DL OSL’15 12 / 35

Improper Learning New search space Original search space Allow the learner to output h �∈ H The argument of Pitt and Valiant fails because the algorithm may return consistent h even though S is unrealizable by H Is deep learning still hard in the improper model ? Shalev-Shwartz (HU) DL OSL’15 12 / 35

Hope ... Generated examples in R 150 and passed them through a random depth- 2 network that contains 60 hidden neurons with the ReLU activation function. Tried to fit a new network to this data with over-specification factors of 1 , 2 , 4 , 8 4 1 2 4 3 8 MSE 2 1 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 · 10 5 #iterations Shalev-Shwartz (HU) DL OSL’15 13 / 35

How to show hardness of improper learning? The argument of Pitt and Valiant fails for improper learning because improper algorithms might perform well on unrealizable samples Shalev-Shwartz (HU) DL OSL’15 14 / 35

How to show hardness of improper learning? The argument of Pitt and Valiant fails for improper learning because improper algorithms might perform well on unrealizable samples Key Observation If a learning algorithm is computationally efficient its output must come from a class of “small” VC dimension Hence, it cannot perform well on “very random” samples Shalev-Shwartz (HU) DL OSL’15 14 / 35

On the Computational Complexity of Deep Learning Shai Shalev-Shwartz - PowerPoint PPT Presentation

On the Computational Complexity of Deep Learning Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Optimization and Statistical Learning, Les Houches, January 2014 Based on joint work with: Roi Livni

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

A note on the complexity of backward induction games Jakub Szymanik RAIN @ NASSLLI 2012 Outline

Abstract: Computational Complexity theory deals with the classification of problems into classes

Texts Complexity Theory The main text for the course is: Computational Complexity . Christos H.

Computational Complexity, Orders of Magnitude n Rosen Ch. 3.2: Growth of Functions n Rosen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

How Much Can Be Inferred From Almost Nothing? A Two-Stage Maximum Entropy Approach to Uncertainty

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Brexit by Numbers Joe Twyman, YouGov Head of Political & Social Research Friday 4 November

intermediacy of publications Lovro Ludo Waltman Subelj Leiden University University of

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

POD Reduced-Order Modeling of Complex Fluid Flows Zhu Wang Department of Mathematics University

A Nonlinear Trust Region Framework for PDE-Constrained Optimization Using

Rust<T> Stefan Schindler (@dns2utf8) June 11, 2016 Coredump Rapperswil Outline 1. Admin

On the Computational Complexity of Deep Learning Shai Shalev-Shwartz - PowerPoint PPT Presentation

On the Computational Complexity of Deep Learning Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Optimization and Statistical Learning, Les Houches, January 2014 Based on joint work with: Roi Livni

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

A note on the complexity of backward induction games Jakub Szymanik RAIN @ NASSLLI 2012 Outline

Abstract: Computational Complexity theory deals with the classification of problems into classes

Texts Complexity Theory The main text for the course is: Computational Complexity . Christos H.

Computational Complexity, Orders of Magnitude n Rosen Ch. 3.2: Growth of Functions n Rosen

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

Hans Vangheluwe Modelling and Simulation Causes of Complexity Dealing with Complexity

How Much Can Be Inferred From Almost Nothing? A Two-Stage Maximum Entropy Approach to Uncertainty

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Brexit by Numbers Joe Twyman, YouGov Head of Political &amp; Social Research Friday 4 November

intermediacy of publications Lovro Ludo Waltman Subelj Leiden University University of

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

POD Reduced-Order Modeling of Complex Fluid Flows Zhu Wang Department of Mathematics University

A Nonlinear Trust Region Framework for PDE-Constrained Optimization Using

Rust&lt;T&gt; Stefan Schindler (@dns2utf8) June 11, 2016 Coredump Rapperswil Outline 1. Admin

Brexit by Numbers Joe Twyman, YouGov Head of Political & Social Research Friday 4 November

Rust<T> Stefan Schindler (@dns2utf8) June 11, 2016 Coredump Rapperswil Outline 1. Admin