advanced machine learning
play

Advanced Machine Learning Dense Neural Networks Amit Sethi - PowerPoint PPT Presentation

Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay Learning objectives Learn the motivations behind neural networks Become familiar with neural network terms Understand the working of neural


  1. Advanced Machine Learning Dense Neural Networks Amit Sethi Electrical Engineering, IIT Bombay

  2. Learning objectives • Learn the motivations behind neural networks • Become familiar with neural network terms • Understand the working of neural networks • Understand behind-the-scenes training of neural networks

  3. Neural networks are inspired from mammalian brain • Each unit (neuron) is simple • But, human brain has 100 billion neurons with 100 trillion connections • The strength and nature of the connections stores memories and the “program” that makes us human • A neural network is a web of artificial neurons

  4. Artificial neurons is inspired by biological neurons • Neural networks are made up of artificial neurons • Artificial neurons are only loosely based on real neurons, just like neural networks are only loosely based on the human brain x 1 w 1 w 2 Σ g x 2 w 3 x 3 b 1

  5. Activation function is the secret sauce of neural networks • Neural network training is all about tuning weights x 1 and biases w 1 • If there was no activation w 2 x 2 Σ g function f , the output of w 3 the entire neural network x 3 would be a linear function b of the inputs 1 • The earliest models used a step function

  6. Types of activation functions • Step : original concept behind classification and region bifurcation. Not used anymore • Sigmoid and tanh: trainable approximations of the step-function • ReLU : currently preferred due to fast convergence • Softmax : currently preferred for output of a classification net. Generalized sigmoid • Linear : good for modeling a range in the output of a regression net

  7. Formulas for activation functions sign 𝑦 +1 • Step : 𝑕 𝑦 = 2 1 • Sigmoid : 𝑕 𝑦 = 1+𝑓 −𝑦 • Tanh : 𝑕 𝑦 = tanh (𝑦) • ReLU : 𝑕 𝑦 = max (0, 𝑦) 𝑓 𝑦𝑗 • Softmax : 𝑕 𝑦 𝑗 = 𝑓 𝑦𝑗 𝑗 • Linear : 𝑕 𝑦 = 𝑦

  8. Step function divides the input space into two halves  0 and 1 • In a single neuron, step function is a linear binary classifier • The weights and biases determine where the step will be in n-dimensions • But, as we shall see later, it gives little information about how to change the weights if we make a mistake • So, we need a smoother version of a step function • Enter: the Sigmoid function

  9. The sigmoid function is a smoother step function • Smoothness ensures that there is more information about the direction in which to change the weights if there are errors • Sigmoid function is also mathematically linked to logistic regression, which is a theoretically well-backed linear classifier

  10. The problem with sigmoid is (near) zero gradient on both extremes • For both large positive and negative input values, sigmoid doesn’t change much with change of input • ReLU has a constant gradient for almost half of the inputs • But, ReLU cannot give a meaningful final output

  11. Output activation functions can only be of the following kinds • Sigmoid gives binary classification output • Tanh can also do that provided the desired output is in {-1, +1} • Softmax generalizes sigmoid to n-ary classification • Linear is used for regression • ReLU is only used in internal nodes (non- output)

  12. Contents • Introduction to neural networks • Feed forward neural networks • Gradient descent and backpropagation • Learning rate setting and tuning

  13. Basic structure of a neural network • It is feed forward y 1 y 2 … y n – Connections from inputs towards outputs – No connection … comes backwards • It consists of layers … … … – Current layer’s h 1n input is previous h 11 h 12 … 1 layer’s output – No lateral (intra- layer) connections … x 1 x 2 x d • That’s it!

  14. Basic structure of a neural network • Output layer – y 1 y 2 … y n Represent the output of the neural network – For a two class problem or regression with a 1-d output, we need only one output node • Hidden layer(s) … – Represent the intermediary nodes that divide the input space into regions with (soft) boundaries – These usually form a hidden layer … … … – Usually, there is only one such layer – Given enough hidden nodes, we can model an arbitrary input- h 1n output relation. h 11 h 12 … • 1 Input layer – Represent dimensions of the input vector (one node for each dimension) – These usually form an input layer , … x 1 x 2 x d and – Usually there is only one such layer

  15. Importance of hidden layers + − + • First hidden + Single + − sigmoid layer extracts − − + + − + features − − + − • Second hidden + + layer extracts features of features • … + − + + Sigmoid • Output layer + − hidden − − gives the layers and + + − + − sigmoid desired output − + − output + +

  16. Overall function of a neural network • 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Weights form a matrix • Output of the previous layer form a vector • The activation (nonlinear) function is applied point-wise to the weight times input • Design questions (hyper parameters): – Number of layers – Number of neurons in each layer (rows of weight matrices)

  17. Training the neural network • Given 𝒚 𝑗 and 𝑧 𝑗 • Think of what hyper-parameters and neural network design might work • Form a neural network: 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Compute 𝑔 𝒙 𝒚 𝑗 as an estimate of 𝑧 𝑗 for all samples • Compute loss: 1 = 1 𝑂 𝑂 𝑂 𝑂 𝑀(𝑔 𝒙 𝒚 𝑗 , 𝑧 𝑗 ) 𝑚 𝑗 (𝒙) 𝑗=1 𝑗=1 • Tweak 𝒙 to reduce loss (optimization algorithm) • Repeat last three steps

  18. Loss function choice • There are positive and negative errors in classification and MSE is the most common loss function • There is error→ probability of correct class in classification, for which cross entropy is the most common loss function error→

  19. Some loss functions and their derivatives • Terminology – 𝑧 is the output – 𝑢 is the target output • Mean square error • Loss: (𝑧 − 𝑢 ) 2 • Derivative of the loss: 2(𝑧 − 𝑢 ) • Cross entropy 𝐷 • Loss: − 𝑢 𝑑 log 𝑧 𝑑 𝑑=1 1 • Derivative of the loss: − 𝑧 𝑑 | 𝑑=𝜕

  20. Computational graph of a single hidden layer NN x ? * ReL + Z1 A1 W1 U b1 ? * SoftM + Z2 A2 W2 ax b2 CE Loss targ et

  21. Advanced Machine Learning Backpropagation Amit Sethi Electrical Engineering, IIT Bombay

  22. Learning objectives • Write derivative of a nested function using chain rule • Articulate how storage of partial derivatives leads to an efficient gradient descent for neural networks • Write gradient descent as matrix operations

  23. Overall function of a neural network • 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Weights form a matrix • Output of the previous layer form a vector • The activation (nonlinear) function is applied point-wise to the weight times input • Design questions (hyper parameters): – Number of layers – Number of neurons in each layer (rows of weight matrices)

  24. Training the neural network • Given 𝒚 𝑗 and 𝑧 𝑗 • Think of what hyper-parameters and neural network design might work • Form a neural network: 𝑔 𝒚 𝑗 = 𝑕 𝑚 (𝑿 𝑚 ∗ 𝑕 𝑚−1 𝑿 𝑚−1 … 𝑕 1 𝑿 1 ∗ 𝒚 𝑗 … ) • Compute 𝑔 𝒙 𝒚 𝑗 as an estimate of 𝑧 𝑗 for all samples • Compute loss: 1 = 1 𝑂 𝑂 𝑂 𝑂 𝑀(𝑔 𝒙 𝒚 𝑗 , 𝑧 𝑗 ) 𝑚 𝑗 (𝒙) 𝑗=1 𝑗=1 • Tweak 𝒙 to reduce loss (optimization algorithm) • Repeat last three steps

  25. Gradient ascent • If you didn’t know the shape of a mountain • But at every step you knew the slope • Can you reach the top of the mountain?

  26. Gradient descent minimizes the loss function • At every point, compute • Loss (scalar): 𝑚 𝑗 (𝒙) • Gradient of loss with respect to weights (vector): 𝛼 𝒙 𝑚 𝑗 (𝒙) • Take a step towards negative gradient: 𝑂 1 𝒙 ← 𝒙 − 𝜃 𝛼 𝑂 𝑚 𝑗 (𝒙) 𝒙 𝑗=1

  27. Derivative of a function of a scalar E.g. 𝑔 𝑦 = 𝑏𝑦 2 + 𝑐𝑦 + 𝑑, 𝑔 ′ (𝑦) = 2𝑏𝑦 + 𝑐, 𝑔′′ 𝑦 = 2𝑏 𝑒 𝑔(𝑦) • Derivative 𝑔’ 𝑦 = 𝑒 𝑦 is the rate of change of 𝑔 𝑦 with 𝑦 • It is zero when then function is flat (horizontal), such as at the minimum or maximum of 𝑔 𝑦 • It is positive when 𝑔 𝑦 is sloping up, and negative when 𝑔 𝑦 is sloping down • To move towards the maxima, taking a small step in a direction of the derivative

  28. Gradient of a function of a vector • Derivative with respect to each dimension, holding other dimensions constant f(x 1 , x 2 ) → 𝜖𝑔 𝜖𝑦 1 • 𝛼𝑔 𝒚 = 𝛼𝑔 𝑦 1 , 𝑦 2 = 𝜖𝑔 𝜖𝑦 2 • At a minima or a maxima the gradient is a zero vector The function is flat in every direction • At a minima or a maxima the gradient is a zero vector

  29. Gradient of a function of a vector • Gradient gives a direction for moving towards the minima • Take a small step towards f(x 1 , x 2 ) → negative of the gradient

Recommend


More recommend