MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, - PowerPoint PPT Presentation

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017

What? Classification Object classification What’s in this image? Note : beyond vision: classify graphs, strings, networks, time-series. . . L.Rosasco

⚠ What makes the problem hard? ◮ Viewpoint ◮ Semantic variability Note : Identification vs categorization. . . L.Rosasco

Categorization: a learning approach Training mug mug mug … remote remote remote … Test mug mug remote mug remote remote L.Rosasco

Supervised learning Given ( x 1 , y 1 ) , . . . , ( x n , y n ) find f such that sign f ( x new ) = y new ◮ x ∈ R D a vectorization of an image ◮ y = ± 1 a label (mug/remote) L.Rosasco

Learning and data representation Consider f ( x ) = w ⊤ Φ( x ) a two steps learning scheme is often considered ◮ supervised learning of w ◮ expert design or unsupervised learning of the data representation Φ L.Rosasco

Data representation Φ : R D → R p A mapping of data in a new format better suited for further processing L.Rosasco

Data representation by design Dictionaries of features ◮ Wavelet & friends. ◮ SIFT, HoG etc. Kernels 2 γ ◮ Classic off the shelf: Gaussian K ( x, x ′ ) = e − � x − x ′ � ◮ Structured input: kernels on histograms, graphs etc. L.Rosasco

In practice all is multi-layer! (an old slide) Data representation schemes e.g. vision-speech, involve multiple ( layers ). Pipeline Raw data are often processed: ◮ first computing some of low level features, ◮ then learning some mid level representation, ◮ . . . ◮ finally using supervised learning. These stages are often done separately: ◮ good way to exploit unlabelled data. . . ◮ but is it possible to design end-to-end learning systems? L.Rosasco

In practice all is deep-learning! (updated slide) Data representation schemes e.g. vision-speech, involve deep learning . Pipeline ◮ Design some wild- but “differentiable” hierarchical architecture. ◮ Proceed with end-to-end learning!! Architecture (rather than feature) engineering L.Rosasco

Road Map Part I: Basics neural networks ◮ Neural networks definition ◮ Optimization +approximation and statistics Part II: One step beyond ◮ Auto-encoders ◮ Convolutional neural networks ◮ Tips and tricks L.Rosasco

Part I: Basic Neural Networks L.Rosasco

Shallow nets f ( x ) = w ⊤ Φ( x ) , x �→ Φ( x ) � �� Fixed . Examples ◮ Dictionaries Φ( x ) = cos( B ⊤ x ) = (cos( β ⊤ 1 x ) , . . . , cos( β ⊤ p x )) with B = β 1 , . . . , β p fixed frequencies. ◮ Kernel methods Φ( x ) = ( e −� β 1 − x � 2 , . . . , e −� β n − x � 2 ) with β 1 = x 1 , . . . , β n = x n the input points. L.Rosasco

Shallow nets (cont.) f ( x ) = w ⊤ Φ( x ) , x �→ Φ( x ) � �� Fixed Empirical Risk Minimization (ERM) � n ( y i − w ⊤ Φ( x i )) 2 min w i =1 Note : The function f depends linearly on w , the ERM problem is convex ! L.Rosasco

Interlude: optimization by Gradient Descent (GD) Batch gradient descent w t +1 = w t − γ ∇ w � E ( w t ) where � n � ( y i − w ⊤ Φ( x i )) 2 E ( w ) = i =1 so that � n ∇ w � Φ( x i ) ⊤ ( y i − w ⊤ Φ( x i )) E ( w ) = − 2 i =1 ◮ Constant step-size depending on the curvature (Hessian norm) ◮ It is a descent method L.Rosasco

Gradient descent illustrated L.Rosasco

Stochastic gradient descent (SGD) w t +1 = w t + 2 γ t Φ( x t ) ⊤ ( y t − w ⊤ t Φ( x t )) Compare to � n Φ( x i ) ⊤ ( y i − w ⊤ w t +1 = w t + 2 γ t Φ( x i )) i =1 √ ◮ Decaying step-size γ = 1 / t ◮ Lower iteration cost ◮ It is not a descent method (SG D ?) ◮ Multiple passes ( epochs ) over data needed L.Rosasco

SGD vs GD L.Rosasco

Summary so far Given data ( x 1 , y 1 ) , . . . , ( x n , y n ) and a fixed representation Φ ◮ Consider f ( x ) = w ⊤ Φ( x ) ◮ Find w by SGD w t +1 = w t + 2 γ t Φ( x t ) ⊤ ( y t − w ⊤ Φ( x t )) Can we jointly learn Φ ? L.Rosasco

Neural Nets Basic idea: compose simply parameterized representations Φ = Φ L ◦ · · · ◦ Φ 2 ◦ Φ 1 Let d 0 = D and Φ ℓ : R d ℓ − 1 → R d ℓ , ℓ = 1 , . . . , L and in particular Φ ℓ = σ ◦ W ℓ , ℓ = 1 , . . . , L where W ℓ : R d ℓ − 1 → R d ℓ , ℓ = 1 , . . . , L linear/affine and σ is a non linear map acting component-wise σ : R → R . L.Rosasco

Deep neural nets f ( x ) = w ⊤ Φ L ( x ) , Φ L = Φ L ◦ · · · ◦ Φ 1 � �� compositional representation Φ 1 = σ ◦ W 1 . . . Φ L = σ ◦ W L ERM � n 1 ( y i − w ⊤ Φ L ( x i )) 2 min n w, ( W j ) j i =1 L.Rosasco

Neural networks jargoon Φ L ( x ) = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) ◮ Each intermediate representation corresponds to a (hidden) layer ◮ The dimensionalities ( d ℓ ) ℓ correspond to the number of hidden units ◮ The non linearity σ is called activation function L.Rosasco

Neural networks & neurons X 3 W > W t j x t j x = t =1 W 1 W 3 j j W 2 j x 1 x 2 x 3 hi, i am a neuron ◮ Each neuron compute an inner product based on a column of a weight matrix W ◮ The non-linearity σ is the neuron activation function. L.Rosasco

Deep neural networks X 3 W > W t j x t j x = t =1 W 1 W 3 j j W 2 j x 1 x 2 x 3 L.Rosasco

Activation functions For α ∈ R consider, ◮ sigmoid s ( α ) = 1 / (1 + e − α ) t , ◮ hyperbolic tangent s ( α ) = ( e α − e − α ) / ( e α + e − α ) , ◮ ReLU s ( α ) = | α | + (aka ramp, hinge), ◮ Softplus s ( α ) = log(1 + e α ) . L.Rosasco

Some questions f w, ( W ℓ ) ℓ ( x ) = w ⊤ Φ ( W ℓ ) ℓ ( x ) , Φ ( W ℓ ) ℓ = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) We have our model but: ◮ Optimization: Can we train efficiently? ◮ Approximation: Are we dealing with rich models? ◮ Statistics: How hard is it generalize from finite data ? L.Rosasco

Neural networks function spaces Consider the non linear space of functions of the form f w, ( W ℓ ) ℓ : R D → R , f w, ( W ℓ ) ℓ ( x ) = w ⊤ Φ ( W ℓ ) ℓ ( x ) , Φ ( W ℓ ) ℓ = σ ( W L . . . σ ( W 2 σ ( W 1 x ))) where w, ( W ℓ ) ℓ may vary. Very little structure. . . but we can : ◮ train by gradient descent (next) ◮ get (some) approximation/statistical guarantees (later) L.Rosasco

One layer neural networks Consider only one hidden layer: � u � x ⊤ W j � f w,W ( x ) = w ⊤ σ ( Wx ) = w j σ j =1 and ERM again n � ( y i − f w,W ( x i )) 2 , i =1 L.Rosasco

Computations Consider � n � � ( y i − f ( w,W ) ( x i ))) 2 . min E ( w, W ) , E ( w, W ) = w,W i =1 Problem is non-convex! ( possibly smooth depending on σ ) L.Rosasco

Back-propagation & GD Empirical risk minimization, n � � � ( y i − f ( w,W ) ( x i ))) 2 . min E ( w, W ) , E ( w, W ) = w,W i =1 An approximate minimizer is computed via the following gradient method ∂ � E w t +1 w t ( w t , W t ) = j − γ t j ∂w j ∂ � E W t +1 W t ( w t +1 , W t ) = j,k − γ t j,k ∂W j,k where the step-size ( γ t ) t is often called learning rate. L.Rosasco

Back-propagation & chain rule Direct computations show that: n ∂ � � E ( w, W ) = − 2 ( y i − f ( w,W ) ( x i ))) h j,i ∂w j � �� i =1 ∆ j,i ∂ � � n E ( y i − f ( w,W ) ( x i ))) w j σ ′ ( w ⊤ x k ( w, W ) = − 2 j x ) i ∂W j,k � �� i =1 η i,k Back-prop equations: η i,k = ∆ j,i c j σ ′ ( w ⊤ j x ) Using above equations, the updates are performed in two steps: ◮ Forward pass compute function values keeping weights fixed, ◮ Backward pass compute errors and propagate ◮ Hence the weights are updated. L.Rosasco

SGD is typically preferred w t +1 w t = j − γ t 2( y t − f ( w t ,W t ) ( x t ))) h j,t j W t +1 W t j,k − γ t 2( y t − f ( w t +1 ,W t ) ( x t ))) w j σ ′ ( w ⊤ j x ) x k = j,k t L.Rosasco

Non convexity and SGD L.Rosasco

Few remarks ◮ Optimization by gradient methods – typically SGD ◮ Online update rules are potentially biologically plausible– Hebbian learning rules describing neuron plasticity ◮ Multiple layers can be analogously considered ◮ Multiple step-size per layers can be considered ◮ Initialization is tricky- more later ◮ NO convergence guarantees ◮ More tricks later L.Rosasco

Some questions ◮ What is the benefit of multiple layers? ◮ Why does stochastic gradient seem to work? L.Rosasco

Wrapping up part I ◮ Learning classifier and representation ◮ From shallow to deep learning ◮ SGD and backpropagation L.Rosasco

Coming up ◮ Autoencoders and unsupervised data? ◮ Convolutional neural networks ◮ Tricks and tips L.Rosasco

Part II: L.Rosasco

Unsupervised learning with neural networks ◮ Because unlabeled data abound ◮ Because one could use obtained weight for initialize supervised learning (pre-training) L.Rosasco

Auto-encoders x W x ◮ A neural network with one input layer, one output layer and one (or more) hidden layers connecting them. ◮ The output layer has equally many nodes as the input layer, ◮ It is trained to predict the input rather than some target output. L.Rosasco

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, - PowerPoint PPT Presentation

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification Object classification Whats in this image? Note : beyond vision: classify graphs, strings, networks, time-series. . . L.Rosasco What makes the

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

MLCC 2017 Machine Learning Crash Course Universita' di Genova, Summer, 2017 Instructor : Lorenzo

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Jess Stohlmann-Rainey @JessStohlmann #PSW20 Worki king Ha Harder Isnt W Wor orking: H

High-Dimensional Multivariate Bayesian Linear Regression with Shrinkage Priors Ray Bai

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes

Children in Foster Care December 4, 2018 Webinar Instructions Remember to Turn on Your Computer

Biology Summer Assignment PART 1 Topic: Ecosystems Directions Instructions: Review the

Starter What are they? Name a BIOTIC factor and an Abiotic and Biotic: whats the common

biotic and abiotic characteristics . ABIOTIC BIOTIC water minerals Other organisms Question

Ab Abiotic/Bi Biotic Reduction of Trichloroethene and Pe Perchlorate: Laboratory y

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, - PowerPoint PPT Presentation

MLCC 2017 Deep Learning Lorenzo Rosasco UNIGE-MIT-IIT June 29, 2017 What? Classification Object classification Whats in this image? Note : beyond vision: classify graphs, strings, networks, time-series. . . L.Rosasco What makes the

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning

MLCC 2015 machine learning applications Francesca Odone ML applications Machine Learning

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

MLCC 2017 Machine Learning Crash Course Universita' di Genova, Summer, 2017 Instructor : Lorenzo

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Jess Stohlmann-Rainey @JessStohlmann #PSW20 Worki king Ha Harder Isnt W Wor orking: H

High-Dimensional Multivariate Bayesian Linear Regression with Shrinkage Priors Ray Bai

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes

Children in Foster Care December 4, 2018 Webinar Instructions Remember to Turn on Your Computer

Biology Summer Assignment PART 1 Topic: Ecosystems Directions Instructions: Review the

Starter What are they? Name a BIOTIC factor and an Abiotic and Biotic: whats the common

biotic and abiotic characteristics . ABIOTIC BIOTIC water minerals Other organisms Question

Ab Abiotic/Bi Biotic Reduction of Trichloroethene and Pe Perchlorate: Laboratory y

Deep learning for natural language processing A short primer on deep learning Benoit Favre <