neural networks ii
play

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824 Neural Networks Origins: Algorithms that try to mimic the brain. What is this? A single neuron in the brain Input Output Slide credit: Andrew Ng An artificial


  1. Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Neural Networks β€’ Origins: Algorithms that try to mimic the brain. What is this?

  3. A single neuron in the brain Input Output Slide credit: Andrew Ng

  4. An artificial neuron: Logistic unit 𝑦 0 πœ„ 0 β€œBias unit” 𝑦 0 𝑦 1 πœ„ 1 𝑦 = πœ„ = 𝑦 2 πœ„ 2 β€œWeights” 𝑦 3 𝑦 1 πœ„ 3 β€œParameters” β€œOutput” 1 𝑦 2 β„Ž πœ„ 𝑦 = 1 + 𝑓 βˆ’πœ„ ⊀ 𝑦 𝑦 3 β€’ Sigmoid (logistic) activation function β€œInput” Slide credit: Andrew Ng

  5. Visualization of weights, bias, activation function range determined by g(.) bias b only change the position of the hyperplane Slide credit: Hugo Larochelle

  6. Activation - sigmoid β€’ Squashes the neuron’s pre - activation between 0 and 1 β€’ Always positive β€’ Bounded β€’ Strictly increasing 1 𝑕 𝑦 = 1 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  7. Activation - hyperbolic tangent (tanh) β€’ Squashes the neuron’s pre - activation between -1 and 1 β€’ Can be positive or negative β€’ Bounded β€’ Strictly increasing 𝑕 𝑦 = tanh 𝑦 = 𝑓 𝑦 βˆ’ 𝑓 βˆ’π‘¦ 𝑓 𝑦 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  8. Activation - rectified linear(relu) β€’ Bounded below by 0 β€’ always non-negative β€’ Not upper bounded β€’ Tends to give neurons with sparse activities 𝑕 𝑦 = relu 𝑦 = max 0, 𝑦 Slide credit: Hugo Larochelle

  9. Activation - softmax β€’ For multi-class classification: β€’ we need multiple outputs (1 output per class) β€’ we would like to estimate the conditional probability π‘ž 𝑧 = 𝑑 | 𝑦 β€’ We use the softmax activation function at the output: 𝑓 𝑦 1 𝑓 𝑦 𝑑 𝑕 𝑦 = softmax 𝑦 = Οƒ 𝑑 𝑓 𝑦 𝑑 … Οƒ 𝑑 𝑓 𝑦 𝑑 Slide credit: Hugo Larochelle

  10. Universal approximation theorem β€˜β€˜a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units ’’ Hornik, 1991 Slide credit: Hugo Larochelle

  11. Neural network – Multilayer (2) 𝑦 0 𝑏 0 (2) 𝑦 1 𝑏 1 β€œOutput” (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 Layer 3 Layer 1 Layer 2 (hidden) Slide credit: Andrew Ng

  12. Neural network (π‘˜) = β€œactivation” of unit 𝑗 in layer π‘˜ 𝑏 𝑗 (2) 𝑦 0 𝑏 0 Θ π‘˜ = matrix of weights controlling (2) 𝑦 1 𝑏 1 function mapping from layer π‘˜ to layer π‘˜ + 1 (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 𝑑 π‘˜ unit in layer π‘˜ (2) 𝑦 3 𝑏 3 𝑑 π‘˜+1 units in layer π‘˜ + 1 (2) = 𝑕 Θ 10 (1) 𝑦 0 + Θ 11 (1) 𝑦 1 + Θ 12 (1) 𝑦 2 + Θ 13 (1) 𝑦 3 𝑏 1 Size of Θ π‘˜ ? (2) = 𝑕 Θ 20 (1) 𝑦 0 + Θ 21 (1) 𝑦 1 + Θ 22 (1) 𝑦 2 + Θ 23 (1) 𝑦 3 𝑏 2 (2) = 𝑕 Θ 30 (1) 𝑦 0 + Θ 31 (1) 𝑦 1 + Θ 32 (1) 𝑦 2 + Θ 33 (1) 𝑦 3 𝑏 3 𝑑 π‘˜+1 Γ— (𝑑 π‘˜ + 1) (2) + Θ 11 (2) + Θ 12 (2) + Θ 13 (2) 𝑏 0 (1) 𝑏 1 (1) 𝑏 2 (1) 𝑏 3 (2) β„Ž Θ (𝑦) = 𝑕 Θ 10 Slide credit: Andrew Ng

  13. Neural network β€œPre - activation” 𝑦 0 (2) z 1 (2) 𝑦 0 𝑏 0 𝑦 1 z (2) = (2) 𝑦 = z 2 𝑦 2 (2) 𝑦 1 𝑏 1 (2) 𝑦 3 z 3 (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 Why do we need g(.)? (2) = 𝑕 Θ 10 (1) 𝑦 0 + Θ 11 (1) 𝑦 1 + Θ 12 (1) 𝑦 2 + Θ 13 (1) 𝑦 3 (2) ) 𝑏 1 = 𝑕(z 1 (2) = 𝑕 Θ 20 (1) 𝑦 0 + Θ 21 (1) 𝑦 1 + Θ 22 (1) 𝑦 2 + Θ 23 (1) 𝑦 3 (2) ) 𝑏 2 = 𝑕(z 2 (2) = 𝑕 Θ 30 (1) 𝑦 0 + Θ 31 (1) 𝑦 1 + Θ 32 (1) 𝑦 2 + Θ 33 (1) 𝑦 3 (2) ) 𝑏 3 = 𝑕(z 3 2 𝑏 0 2 + Θ 11 1 𝑏 1 2 + Θ 12 1 𝑏 2 2 + Θ 13 1 𝑏 3 2 = 𝑕(𝑨 (3) ) β„Ž Θ 𝑦 = 𝑕 Θ 10 Slide credit: Andrew Ng

  14. Neural network β€œPre - activation” 𝑦 0 (2) z 1 (2) 𝑦 0 𝑏 0 𝑦 1 z (2) = (2) 𝑦 = z 2 𝑦 2 (2) 𝑦 1 𝑏 1 (2) 𝑦 3 z 3 (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 𝑨 (2) = Θ (1) 𝑦 = Θ (1) 𝑏 (1) (2) = 𝑕(z 1 (2) ) 𝑏 1 𝑏 (2) = 𝑕(𝑨 (2) ) (2) = 𝑕(z 2 (2) ) 𝑏 2 (2) = 1 Add 𝑏 0 (2) = 𝑕(z 3 (2) ) 𝑏 3 𝑨 (3) = Θ (2) 𝑏 (2) β„Ž Θ 𝑦 = 𝑕(𝑨 (3) ) β„Ž Θ 𝑦 = 𝑏 (3) = 𝑕(𝑨 (3) ) Slide credit: Andrew Ng

  15. Flow graph - Forward propagation 𝑨 (2) 𝑨 (3) 𝑏 (2) 𝑏 (3) X β„Ž Θ 𝑦 How do we evaluate 𝑋 (1) 𝑐 (1) 𝑋 (2) 𝑐 (2) our prediction? 𝑨 (2) = Θ (1) 𝑦 = Θ (1) 𝑏 (1) 𝑏 (2) = 𝑕(𝑨 (2) ) (2) = 1 Add 𝑏 0 𝑨 (3) = Θ (2) 𝑏 (2) β„Ž Θ 𝑦 = 𝑏 (3) = 𝑕(𝑨 (3) )

  16. Cost function Logistic regression: Neural network: Slide credit: Andrew Ng

  17. Gradient computation Need to compute: Slide credit: Andrew Ng

  18. Gradient computation Given one training example 𝑦, 𝑧 𝑏 (1) = 𝑦 𝑨 (2) = Θ (1) 𝑏 (1) 𝑏 (2) = 𝑕(𝑨 (2) ) (add a 0 (2) ) 𝑨 (3) = Θ (2) 𝑏 (2) 𝑏 (3) = 𝑕(𝑨 (3) ) (add a 0 (3) ) 𝑨 (4) = Θ (3) 𝑏 (3) 𝑏 (4) = 𝑕 𝑨 4 = β„Ž Θ 𝑦 Slide credit: Andrew Ng

  19. Gradient computation: Backpropagation (π‘š) = β€œerror” of node π‘˜ in layer π‘š Intuition: πœ€ π‘˜ For each output unit (layer L = 4) 𝑨 (3) = Θ (2) 𝑏 (2) πœ€ (4) = 𝑏 (4) βˆ’ 𝑧 𝑏 (3) = 𝑕(𝑨 (3) ) πœ€ (3) = πœ€ (4) πœ– πœ€ (4) πœ–π‘¨ (3) = πœ€ (4) πœ– πœ€ (4) πœ–π‘ (4) πœ–π‘¨ (4) πœ–π‘ (3) 𝑨 (4) = Θ (3) 𝑏 (3) πœ–π‘ (4) πœ–π‘¨ (4) πœ–π‘ (3) πœ–π‘¨ (3) 𝑏 (4) = 𝑕 𝑨 4 = 1 * Θ 3 π‘ˆ πœ€ (4) .βˆ— 𝑕′ 𝑨 4 .βˆ— 𝑕′(𝑨 (3) ) Slide credit: Andrew Ng

  20. Backpropagation algorithm 𝑦 (1) , 𝑧 (1) … 𝑦 (𝑛) , 𝑧 (𝑛) Training set Set Θ (1) = 0 For 𝑗 = 1 to 𝑛 Set 𝑏 (1) = 𝑦 Perform forward propagation to compute 𝑏 (π‘š) for π‘š = 2. . 𝑀 use 𝑧 (𝑗) to compute πœ€ (𝑀) = 𝑏 (𝑀) βˆ’ 𝑧 (𝑗) Compute πœ€ (π‘€βˆ’1) , πœ€ (π‘€βˆ’2) … πœ€ (2) Θ (π‘š) = Θ (π‘š) βˆ’ 𝑏 (π‘š) πœ€ (π‘š+1) Slide credit: Andrew Ng

  21. Activation - sigmoid β€’ Partial derivative 𝑕′ 𝑦 = 𝑕 𝑦 1 βˆ’ 𝑕 𝑦 1 𝑕 𝑦 = 1 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  22. Activation - hyperbolic tangent (tanh) β€’ Partial derivative 𝑕′ 𝑦 = 1 βˆ’ 𝑕 𝑦 2 𝑕 𝑦 = tanh 𝑦 = 𝑓 𝑦 βˆ’ 𝑓 βˆ’π‘¦ 𝑓 𝑦 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  23. Activation - rectified linear(relu) β€’ Partial derivative 𝑕′ 𝑦 = 1 𝑦 > 0 𝑕 𝑦 = relu 𝑦 = max 0, 𝑦 Slide credit: Hugo Larochelle

  24. Initialization β€’ For bias β€’ Initialize all to 0 β€’ For weights β€’ Can’t initialize all weights to the same value β€’ we can show that all hidden units in a layer will always behave the same β€’ need to break symmetry β€’ Recipe: U[-b, b] β€’ the idea is to sample around 0 but break symmetry Slide credit: Hugo Larochelle

  25. Putting it together Pick a network architecture β€’ No. of input units: Dimension of features β€’ No. output units: Number of classes β€’ Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) β€’ Grid search Slide credit: Hugo Larochelle

  26. Putting it together Early stopping β€’ Use a validation set performance to select the best configuration β€’ To select the number of epochs, stop training when validation set error increases Slide credit: Hugo Larochelle

  27. Other tricks of the trade β€’ Normalizing your (real-valued) data β€’ Decaying the learning rate β€’ as we get closer to the optimum, makes sense to take smaller update steps β€’ mini-batch β€’ can give a more accurate estimate of the risk gradient β€’ Momentum β€’ can use an exponential average of previous gradients Slide credit: Hugo Larochelle

  28. Dropout β€’ Idea: Β«crippleΒ» neural network by removing hidden units β€’ each hidden unit is set to 0 with probability 0.5 β€’ hidden units cannot co-adapt to other units β€’ hidden units must be more generally useful Slide credit: Hugo Larochelle

Recommend


More recommend