Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain

2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos ‣ for a more detailed   description of   neural networks… ‣ … and much more!

3 NEURAL NETWORKS • What we’ll cover ... • f ( x ) ‣ how neural networks take input x and make predict f ( x ) - forward propagation - types of units 1 ... ... ‣ how to train neural nets (classifiers) on data - loss function - backpropagation - gradient descent algorithms 1 ... ... - tricks of the trade ‣ deep learning - unsupervised pre-training 1 ... ... • x 1 x j x d - dropout x - batch normalization

Neural Networks Making predictions with feedforward neural networks

    5 ARTIFICIAL NEURON Topics: connection weights, bias, activation function • Neuron pre-activation (or input activation): i w i x i = b + w > x • a ( x ) = b + P P P • • Neuron (output) activation   d b w • h ( x ) = g ( a ( x )) = g ( b + P i w i x i ) 1 b w 1 w d ... + w > • x 1 x d are the connection weights   • { is the neuron bias   b is called the activation function • g ( · )

6 ARTIFICIAL NEURON Topics: connection weights, bias, activation function y 1 • w • { x 2 • { 1 range determined   • g ( · ) by 0 1 · ) b bias only -1 0 changes the biais -1 position of 0 -1 the riff 1 x 1 (from Pascal Vincent’s slides)

7 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network R´ eseaux de neurones x 2 z 1 0 1 -1 0 -1 0 -1 1 x 1 z k sortie k y 1 y 2 x 2 x 2 1 -.4 w kj -1 1 y 1 y 2 .7 0 0 1 1 cach´ ee j -1 -1 -1.5 0 0 biais -1 -1 0 .5 0 -1 -1 1 w ji 1 1 1 1 x 1 x 1 1 entr´ ee i x 2 x 1 x (from Pascal Vincent’s slides)

8 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network x 2 z 1 x 1 y 2 z 1 y 3 y 1 y 1 y 2 y 3 y 4 y 4 x 1 x 2 (from Pascal Vincent’s slides)

9 CAPACITY OF NEURAL NETWORK Topics: single hidden layer neural network x 2 trois couches R 1 R 2 ... R 2 R 1 x 1 x 2 x 1 (from Pascal Vincent’s slides)

10 CAPACITY OF NEURAL NETWORK Topics: universal approximation • Universal approximation theorem (Hornik, 1991) : ‣ ‘‘a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ • The result applies for sigmoid, tanh and many other hidden layer activation functions • This is a good result, but it doesn’t mean there is a learning algorithm that can find the necessary parameter values!

11 NEURAL NETWORK Topics: multilayer neural network • Could have L hidden layers: ... • ‣ layer pre-activation for k> 0 ( h (0) ( x ) = x ) b (3) W (3) • a ( k ) ( x ) = b ( k ) + W ( k ) h ( k � 1) ( x ) ( 1 ... ... h (2) ( x ) • b (2) ‣ hidden layer activation ( k from 1 to L ): W (2) • • h ( k ) ( x ) = g ( a ( k ) ( x )) 1 ... ... • h (1) ( x ) b (1) ) W (1) • ‣ output layer activation ( k = L + 1 ): 1 ... ... • h ( L +1) ( x ) = o ( a ( L +1) ( x )) = f ( x ) • x 1 x j x d

12 ACTIVATION FUNCTION Topics: sigmoid activation function • Squashes the neuron’s   pre-activation between   0 and 1 • Always positive • Bounded • Strictly increasing • 1 • g ( a ) = sigm( a ) = 1+exp( � a )

13 ACTIVATION FUNCTION Topics: hyperbolic tangent (‘‘tanh’’) activation function • Squashes the neuron’s   pre-activation between   -1 and 1 • Can be positive or   negative • Bounded • Strictly increasing • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 exp(2 a )+1

14 ACTIVATION FUNCTION Topics: rectified linear activation function • Bounded below by 0   (always non-negative) • Not upper bounded • Strictly increasing • Tends to give neurons   with sparse activities • • g ( a ) = reclin( a ) = max(0 , a )

  15 ACTIVATION FUNCTION Topics: softmax activation function • For multi-class classification: ⇣ ‣ we need multiple outputs (1 output per class) • p ( y = c | x ) ‣ we would like to estimate the conditional probability • We use the softmax activation function at the output:   • | i > h exp( a 1 ) exp( a C ) • o ( a ) = softmax( a ) = c exp( a c ) . . . P P c exp( a c ) ‣ strictly positive ‣ sums to one • Predicted class is the one with highest estimated probability

16 FLOW GRAPH Topics: flow graph • Forward propagation can be   represented as an acyclic   x f ( x ) flow graph • (3) W (2) • It’s a nice way of implementing   • a (2) ( x ) = forward propagation in a modular   (3) b (2) • way • h (1) ( x ) = ‣ each box could be an object with an fprop method,   • that computes the value of the box given its   (2) W (1) parents • a (1) ( x ) = (2) b (1) ‣ calling the fprop method of each box in the   right order yield forward propagation (1) x

Neural Networks Training feedforward neural networks

18 MACHINE LEARNING Topics: empirical risk minimization, regularization • Empirical (structural) risk minimization ‣ framework to design learning algorithms X 1 l ( f ( x ( t ) ; θ ) , y ( t ) ) + λ Ω ( θ ) arg min T θ t � • l ( f ( x ( t ) ; θ ) , y ( t ) ) ‣ is a loss function ‣ is a regularizer (penalizes certain values of ) • Ω ( θ ) θ • Learning is cast as optimization ‣ ideally, we’d optimize classification error, but it’s not smooth ‣ loss function is a surrogate for what we truly should optimize (e.g. upper bound)

19 MACHINE LEARNING Topics: stochastic gradient descent (SGD) � • Algorithm that performs updates after each example • • θ ⌘ { W (1) , b (1) , . . . , W ( L +1) , b ( L +1) } ‣ initialize ( ) θ ‣ for N epochs • �r • r 8 ) } - for each training example • ( x ( t ) , y ( t ) ) P r training epoch • � • ∆ = �r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � λ r θ Ω ( θ ) ✓ = iteration over all examples • θ θ + α ∆ ✓ • To apply this algorithm to neural network training, we need • • • l ( f ( x ( t ) ; θ ) , y ( t ) ) ‣ the loss function • r • ‣ a procedure to compute the parameter gradients • r θ l ( f ( x ( t ) ; θ ) , y ( t ) ) � ‣ the regularizer (and the gradient ) • r θ Ω ( θ ) • Ω ( θ ) ‣ initialization method for θ

20 LOSS FUNCTION Topics: loss function for classification • Neural network estimates • • f ( x ) c = p ( y = c | x ) y ( t ) ‣ we could maximize the probabilities of given in the training set • x ( t ) • To frame as minimization, we minimize the   negative log-likelihood natural log (ln) • • l ( f ( x ) , y ) = � P c 1 ( y = c ) log f ( x ) c = � log f ( x ) y ‣ we take the log to simplify for numerical stability and math simplicity ‣ sometimes referred to as cross-entropy

21 BACKPROPAGATION Topics: backpropagation algorithm • Use the chain rule to efficiently compute gradients, top to bottom ‣ compute output gradient (before activation)   • r a ( L +1) ( x ) � log f ( x ) y = � ( e ( y ) � f ( x )) ( ‣ for k from L +1 to 1 � - compute gradients of hidden layer parameter � • r � ( r � h ( k � 1) ( x ) > � � • r W ( k ) � log f ( x ) y = r a ( k ) ( x ) � log f ( x ) y ( • r b ( k ) � log f ( x ) y = r a ( k ) ( x ) � log f ( x ) y ( - compute gradient of hidden layer below = W ( k ) > � � • r h ( k − 1) ( x ) � log f ( x ) y r a ( k ) ( x ) � log f ( x ) y ( � � - compute gradient of hidden layer below (before activation) � � � [ . . . , g 0 ( a ( k � 1) ( x ) j ) , . . . ] � � • r a ( k − 1) ( x ) � log f ( x ) y = r h ( k − 1) ( x ) � log f ( x ) y (

22 ACTIVATION FUNCTION Topics: sigmoid activation function gradient • Partial derivative: • • g 0 ( a ) = g ( a )(1 � g ( a )) • 1 • g ( a ) = sigm( a ) = 1+exp( � a )

23 ACTIVATION FUNCTION Topics: tanh activation function gradient • Partial derivative: � • • g 0 ( a ) = 1 � g ( a ) 2 • g ( a ) = tanh( a ) = exp( a ) � exp( � a ) exp( a )+exp( � a ) = exp(2 a ) � 1 exp(2 a )+1

24 ACTIVATION FUNCTION Topics: rectified linear activation function gradient • Partial derivative: g 0 ( a ) = 1 a> 0 • • g ( a ) = reclin( a ) = max(0 , a )

25 FLOW GRAPH Topics: automatic differentiation • Each object also has a bprop method ‣ it computes the gradient of the loss with   x f ( x ) respect to each parent • (3) W (2) ‣ fprop depends on the fprop of a box’s parents,   • a (2) ( x ) = while bprop depends the bprop of a box’s children (3) b (2) • • By calling bprop in the reverse order,   we get backpropagation • h (1) ( x ) = • (2) W (1) ‣ only need to reach the parameters • a (1) ( x ) = (2) b (1) (1) x

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - PowerPoint PPT Presentation

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORK ONLINE COURSE http://info.usherbrooke.ca/hlarochelle/neural_networks Topics: online videos for a more detailed description of neural networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Deep Convolutional Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine

Poster #20 Bayesian Nonparametric Federated Learning of Neural Networks Mikhail Yurochkin Mayank

Bayesian Neural Networks - Presenters Group 1: A Practical Bayesian Framework for Backpropagation

PDE models of neural networks Beno t Perthame Introduction The electrically active cells

Neural encoding models & maximum likelihood Jonathan Pillow 1 probability leftovers:

ResNet with one-neuron hidden layers is universal approximator Hongzhou Lin, Stefanie Jegelka

When Neurons Fail El Mahdi El Mhamdi, Rachid Guerraoui BDA, Chicago July 25th, 2016 1 / 28

Neural Networks Philipp Koehn 14 April 2020 Philipp Koehn Artificial Intelligence: Neural