Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay
Learning objectives • Learn the motivations behind neural networks • Become familiar with neural network terms • Understand the working of neural networks • Understand behind-the-scenes training of neural networks
Neural networks are inspired from mammalian brain • Each unit (neuron) is simple • But, human brain has 100 billion neurons with 100 trillion connections • The strength and nature of the connections stores memories and the “program” that makes us human • A neural network is a web of artificial neurons
Artificial neurons is inspired by biological neurons • Neural networks are made up of artificial neurons • Artificial neurons are only loosely based on real neurons, just like neural networks are only loosely based on the human brain x 1 w 1 w 2 Σ g x 2 w 3 x 3 b 1
Activation function is the secret sauce of neural networks • Neural network training is all about tuning weights x 1 and biases w 1 • If there was no activation w 2 x 2 Σ g function f , the output of w 3 the entire neural network x 3 would be a linear function b of the inputs 1 • The earliest models used a step function
Types of activation functions • Step : original concept behind classification and region bifurcation. Not used anymore • Sigmoid and tanh: trainable approximations of the step-function • ReLU : currently preferred due to fast convergence • Softmax : currently preferred for output of a classification net. Generalized sigmoid • Linear : good for modeling a range in the output of a regression net
Formulas for activation functions sign 𝑦 +1 • Step : 𝑦 = 2 1 • Sigmoid : 𝑦 = 1+𝑓 −𝑦 • Tanh : 𝑦 = tanh (𝑦) • ReLU : 𝑦 = max (0, 𝑦) 𝑓 𝑦𝑗 • Softmax : 𝑦 𝑗 = 𝑓 𝑦𝑗 𝑗 • Linear : 𝑦 = 𝑦
Step function divides the input space into two halves 0 and 1 • In a single neuron, step function is a linear binary classifier • The weights and biases determine where the step will be in n-dimensions • But, as we shall see later, it gives little information about how to change the weights if we make a mistake • So, we need a smoother version of a step function • Enter: the Sigmoid function
The sigmoid function is a smoother step function • Smoothness ensures that there is more information about the direction in which to change the weights if there are errors • Sigmoid function is also mathematically linked to logistic regression, which is a theoretically well-backed linear classifier
The problem with sigmoid is (near) zero gradient on both extremes • For both large positive and negative input values, sigmoid doesn’t change much with change of input • ReLU has a constant gradient for almost half of the inputs • But, ReLU cannot give a meaningful final output
Output activation functions can only be of the following kinds • Sigmoid gives binary classification output • Tanh can also do that provided the desired output is in {-1, +1} • Softmax generalizes sigmoid to n-ary classification • Linear is used for regression • ReLU is only used in internal nodes (non- output)
Contents • Introduction to neural networks • Feed forward neural networks • Gradient descent and backpropagation • Learning rate setting and tuning
Basic structure of a neural network • It is feed forward y 1 y 2 … y n – Connections from inputs towards outputs – No connection … comes backwards • It consists of layers … … … – Current layer’s h 1n input is previous h 11 h 12 … 1 layer’s output – No lateral (intra- layer) connections … x 1 x 2 x d • That’s it!
Basic structure of a neural network • Output layer – y 1 y 2 … y n Represent the output of the neural network – For a two class problem or regression with a 1-d output, we need only one output node • Hidden layer(s) … – Represent the intermediary nodes that divide the input space into regions with (soft) boundaries – These usually form a hidden layer … … … – Usually, there is only one such layer – Given enough hidden nodes, we can model an arbitrary input- h 1n output relation. h 11 h 12 … • 1 Input layer – Represent dimensions of the input vector (one node for each dimension) – These usually form an input layer , … x 1 x 2 x d and – Usually there is only one such layer
Importance of hidden layers + − + • First hidden + Single + − sigmoid layer extracts − − + + − + features − − + − • Second hidden + + layer extracts features of features • … + − + + Sigmoid • Output layer + − hidden − − gives the layers and + + − + − sigmoid desired output − + − output + +
Overall function of a neural network • 𝑔 𝒚 𝑗 = 𝑚 (𝑿 𝑚 ∗ 𝑚−1 𝑿 𝑚−1 … 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Weights form a matrix • Output of the previous layer form a vector • The activation (nonlinear) function is applied point-wise to the weight times input • Design questions (hyper parameters): – Number of layers – Number of neurons in each layer (rows of weight matrices)
Training the neural network • Given 𝒚 𝑗 and 𝑧 𝑗 • Think of what hyper-parameters and neural network design might work • Form a neural network: 𝑔 𝒚 𝑗 = 𝑚 (𝑿 𝑚 ∗ 𝑚−1 𝑿 𝑚−1 … 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Compute 𝑔 𝒙 𝒚 𝑗 as an estimate of 𝑧 𝑗 for all samples • Compute loss: 1 = 1 𝑂 𝑂 𝑂 𝑂 𝑀(𝑔 𝒙 𝒚 𝑗 , 𝑧 𝑗 ) 𝑚 𝑗 (𝒙) 𝑗=1 𝑗=1 • Tweak 𝒙 to reduce loss (optimization algorithm) • Repeat last three steps
Loss function choice • There are positive and negative errors in classification and MSE is the most common loss function • There is error→ probability of correct class in classification, for which cross entropy is the most common loss function error→
Some loss functions and their derivatives • Terminology – 𝑧 is the output – 𝑢 is the target output • Mean square error • Loss: (𝑧 − 𝑢 ) 2 • Derivative of the loss: 2(𝑧 − 𝑢 ) • Cross entropy 𝐷 • Loss: − 𝑢 𝑑 log 𝑧 𝑑 𝑑=1 1 • Derivative of the loss: − 𝑧 𝑑 | 𝑑=𝜕
Computational graph of a single hidden layer NN x ? * ReL + Z1 A1 W1 U b1 ? * SoftM + Z2 A2 W2 ax b2 CE Loss targ et
Advanced Machine Learning Backpropagation Amit Sethi Electrical Engineering, IIT Bombay
Learning objectives • Write derivative of a nested function using chain rule • Articulate how storage of partial derivatives leads to an efficient gradient descent for neural networks • Write gradient descent as matrix operations
Overall function of a neural network • 𝑔 𝒚 𝑗 = 𝑚 (𝑿 𝑚 ∗ 𝑚−1 𝑿 𝑚−1 … 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Weights form a matrix • Output of the previous layer form a vector • The activation (nonlinear) function is applied point-wise to the weight times input • Design questions (hyper parameters): – Number of layers – Number of neurons in each layer (rows of weight matrices)
Training the neural network • Given 𝒚 𝑗 and 𝑧 𝑗 • Think of what hyper-parameters and neural network design might work • Form a neural network: 𝑔 𝒚 𝑗 = 𝑚 (𝑿 𝑚 ∗ 𝑚−1 𝑿 𝑚−1 … 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Compute 𝑔 𝒙 𝒚 𝑗 as an estimate of 𝑧 𝑗 for all samples • Compute loss: 1 = 1 𝑂 𝑂 𝑂 𝑂 𝑀(𝑔 𝒙 𝒚 𝑗 , 𝑧 𝑗 ) 𝑚 𝑗 (𝒙) 𝑗=1 𝑗=1 • Tweak 𝒙 to reduce loss (optimization algorithm) • Repeat last three steps
Gradient ascent • If you didn’t know the shape of a mountain • But at every step you knew the slope • Can you reach the top of the mountain?
Gradient descent minimizes the loss function • At every point, compute • Loss (scalar): 𝑚 𝑗 (𝒙) • Gradient of loss with respect to weights (vector): 𝛼 𝒙 𝑚 𝑗 (𝒙) • Take a step towards negative gradient: 𝑂 1 𝒙 ← 𝒙 − 𝜃 𝛼 𝑂 𝑚 𝑗 (𝒙) 𝒙 𝑗=1
Derivative of a function of a scalar E.g. 𝑔 𝑦 = 𝑏𝑦 2 + 𝑐𝑦 + 𝑑, 𝑔 ′ (𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏 𝑒 𝑔(𝑦) • Derivative 𝑔’ 𝑦 = 𝑒 𝑦 is the rate of change of 𝑔 𝑦 with 𝑦 • It is zero when then function is flat (horizontal), such as at the minimum or maximum of 𝑔 𝑦 • It is positive when 𝑔 𝑦 is sloping up, and negative when 𝑔 𝑦 is sloping down • To move towards the maxima, taking a small step in a direction of the derivative
Gradient of a function of a vector • Derivative with respect to each dimension, holding other dimensions constant f(x 1 , x 2 ) → 𝜖𝑔 𝜖𝑦 1 • 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦 1 , 𝑦 2 = 𝜖𝑔 𝜖𝑦 2 • At a minima or a maxima the gradient is a zero vector The function is flat in every direction • At a minima or a maxima the gradient is a zero vector
Gradient of a function of a vector • Gradient gives a direction for moving towards the minima • Take a small step towards f(x 1 , x 2 ) → negative of the gradient
Recommend
More recommend