Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824
Neural Networks β’ Origins: Algorithms that try to mimic the brain. What is this?
A single neuron in the brain Input Output Slide credit: Andrew Ng
An artificial neuron: Logistic unit π¦ 0 π 0 βBias unitβ π¦ 0 π¦ 1 π 1 π¦ = π = π¦ 2 π 2 βWeightsβ π¦ 3 π¦ 1 π 3 βParametersβ βOutputβ 1 π¦ 2 β π π¦ = 1 + π βπ β€ π¦ π¦ 3 β’ Sigmoid (logistic) activation function βInputβ Slide credit: Andrew Ng
Visualization of weights, bias, activation function range determined by g(.) bias b only change the position of the hyperplane Slide credit: Hugo Larochelle
Activation - sigmoid β’ Squashes the neuronβs pre - activation between 0 and 1 β’ Always positive β’ Bounded β’ Strictly increasing 1 π π¦ = 1 + π βπ¦ Slide credit: Hugo Larochelle
Activation - hyperbolic tangent (tanh) β’ Squashes the neuronβs pre - activation between -1 and 1 β’ Can be positive or negative β’ Bounded β’ Strictly increasing π π¦ = tanh π¦ = π π¦ β π βπ¦ π π¦ + π βπ¦ Slide credit: Hugo Larochelle
Activation - rectified linear(relu) β’ Bounded below by 0 β’ always non-negative β’ Not upper bounded β’ Tends to give neurons with sparse activities π π¦ = relu π¦ = max 0, π¦ Slide credit: Hugo Larochelle
Activation - softmax β’ For multi-class classification: β’ we need multiple outputs (1 output per class) β’ we would like to estimate the conditional probability π π§ = π | π¦ β’ We use the softmax activation function at the output: π π¦ 1 π π¦ π π π¦ = softmax π¦ = Ο π π π¦ π β¦ Ο π π π¦ π Slide credit: Hugo Larochelle
Universal approximation theorem ββa single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units ββ Hornik, 1991 Slide credit: Hugo Larochelle
Neural network β Multilayer (2) π¦ 0 π 0 (2) π¦ 1 π 1 βOutputβ (2) π¦ 2 β Ξ π¦ π 2 (2) π¦ 3 π 3 Layer 3 Layer 1 Layer 2 (hidden) Slide credit: Andrew Ng
Neural network (π) = βactivationβ of unit π in layer π π π (2) π¦ 0 π 0 Ξ π = matrix of weights controlling (2) π¦ 1 π 1 function mapping from layer π to layer π + 1 (2) π¦ 2 β Ξ π¦ π 2 π‘ π unit in layer π (2) π¦ 3 π 3 π‘ π+1 units in layer π + 1 (2) = π Ξ 10 (1) π¦ 0 + Ξ 11 (1) π¦ 1 + Ξ 12 (1) π¦ 2 + Ξ 13 (1) π¦ 3 π 1 Size of Ξ π ? (2) = π Ξ 20 (1) π¦ 0 + Ξ 21 (1) π¦ 1 + Ξ 22 (1) π¦ 2 + Ξ 23 (1) π¦ 3 π 2 (2) = π Ξ 30 (1) π¦ 0 + Ξ 31 (1) π¦ 1 + Ξ 32 (1) π¦ 2 + Ξ 33 (1) π¦ 3 π 3 π‘ π+1 Γ (π‘ π + 1) (2) + Ξ 11 (2) + Ξ 12 (2) + Ξ 13 (2) π 0 (1) π 1 (1) π 2 (1) π 3 (2) β Ξ (π¦) = π Ξ 10 Slide credit: Andrew Ng
Neural network βPre - activationβ π¦ 0 (2) z 1 (2) π¦ 0 π 0 π¦ 1 z (2) = (2) π¦ = z 2 π¦ 2 (2) π¦ 1 π 1 (2) π¦ 3 z 3 (2) π¦ 2 β Ξ π¦ π 2 (2) π¦ 3 π 3 Why do we need g(.)? (2) = π Ξ 10 (1) π¦ 0 + Ξ 11 (1) π¦ 1 + Ξ 12 (1) π¦ 2 + Ξ 13 (1) π¦ 3 (2) ) π 1 = π(z 1 (2) = π Ξ 20 (1) π¦ 0 + Ξ 21 (1) π¦ 1 + Ξ 22 (1) π¦ 2 + Ξ 23 (1) π¦ 3 (2) ) π 2 = π(z 2 (2) = π Ξ 30 (1) π¦ 0 + Ξ 31 (1) π¦ 1 + Ξ 32 (1) π¦ 2 + Ξ 33 (1) π¦ 3 (2) ) π 3 = π(z 3 2 π 0 2 + Ξ 11 1 π 1 2 + Ξ 12 1 π 2 2 + Ξ 13 1 π 3 2 = π(π¨ (3) ) β Ξ π¦ = π Ξ 10 Slide credit: Andrew Ng
Neural network βPre - activationβ π¦ 0 (2) z 1 (2) π¦ 0 π 0 π¦ 1 z (2) = (2) π¦ = z 2 π¦ 2 (2) π¦ 1 π 1 (2) π¦ 3 z 3 (2) π¦ 2 β Ξ π¦ π 2 (2) π¦ 3 π 3 π¨ (2) = Ξ (1) π¦ = Ξ (1) π (1) (2) = π(z 1 (2) ) π 1 π (2) = π(π¨ (2) ) (2) = π(z 2 (2) ) π 2 (2) = 1 Add π 0 (2) = π(z 3 (2) ) π 3 π¨ (3) = Ξ (2) π (2) β Ξ π¦ = π(π¨ (3) ) β Ξ π¦ = π (3) = π(π¨ (3) ) Slide credit: Andrew Ng
Flow graph - Forward propagation π¨ (2) π¨ (3) π (2) π (3) X β Ξ π¦ How do we evaluate π (1) π (1) π (2) π (2) our prediction? π¨ (2) = Ξ (1) π¦ = Ξ (1) π (1) π (2) = π(π¨ (2) ) (2) = 1 Add π 0 π¨ (3) = Ξ (2) π (2) β Ξ π¦ = π (3) = π(π¨ (3) )
Cost function Logistic regression: Neural network: Slide credit: Andrew Ng
Gradient computation Need to compute: Slide credit: Andrew Ng
Gradient computation Given one training example π¦, π§ π (1) = π¦ π¨ (2) = Ξ (1) π (1) π (2) = π(π¨ (2) ) (add a 0 (2) ) π¨ (3) = Ξ (2) π (2) π (3) = π(π¨ (3) ) (add a 0 (3) ) π¨ (4) = Ξ (3) π (3) π (4) = π π¨ 4 = β Ξ π¦ Slide credit: Andrew Ng
Gradient computation: Backpropagation (π) = βerrorβ of node π in layer π Intuition: π π For each output unit (layer L = 4) π¨ (3) = Ξ (2) π (2) π (4) = π (4) β π§ π (3) = π(π¨ (3) ) π (3) = π (4) π π (4) ππ¨ (3) = π (4) π π (4) ππ (4) ππ¨ (4) ππ (3) π¨ (4) = Ξ (3) π (3) ππ (4) ππ¨ (4) ππ (3) ππ¨ (3) π (4) = π π¨ 4 = 1 * Ξ 3 π π (4) .β πβ² π¨ 4 .β πβ²(π¨ (3) ) Slide credit: Andrew Ng
Backpropagation algorithm π¦ (1) , π§ (1) β¦ π¦ (π) , π§ (π) Training set Set Ξ (1) = 0 For π = 1 to π Set π (1) = π¦ Perform forward propagation to compute π (π) for π = 2. . π use π§ (π) to compute π (π) = π (π) β π§ (π) Compute π (πβ1) , π (πβ2) β¦ π (2) Ξ (π) = Ξ (π) β π (π) π (π+1) Slide credit: Andrew Ng
Activation - sigmoid β’ Partial derivative πβ² π¦ = π π¦ 1 β π π¦ 1 π π¦ = 1 + π βπ¦ Slide credit: Hugo Larochelle
Activation - hyperbolic tangent (tanh) β’ Partial derivative πβ² π¦ = 1 β π π¦ 2 π π¦ = tanh π¦ = π π¦ β π βπ¦ π π¦ + π βπ¦ Slide credit: Hugo Larochelle
Activation - rectified linear(relu) β’ Partial derivative πβ² π¦ = 1 π¦ > 0 π π¦ = relu π¦ = max 0, π¦ Slide credit: Hugo Larochelle
Initialization β’ For bias β’ Initialize all to 0 β’ For weights β’ Canβt initialize all weights to the same value β’ we can show that all hidden units in a layer will always behave the same β’ need to break symmetry β’ Recipe: U[-b, b] β’ the idea is to sample around 0 but break symmetry Slide credit: Hugo Larochelle
Putting it together Pick a network architecture β’ No. of input units: Dimension of features β’ No. output units: Number of classes β’ Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) β’ Grid search Slide credit: Hugo Larochelle
Putting it together Early stopping β’ Use a validation set performance to select the best configuration β’ To select the number of epochs, stop training when validation set error increases Slide credit: Hugo Larochelle
Other tricks of the trade β’ Normalizing your (real-valued) data β’ Decaying the learning rate β’ as we get closer to the optimum, makes sense to take smaller update steps β’ mini-batch β’ can give a more accurate estimate of the risk gradient β’ Momentum β’ can use an exponential average of previous gradients Slide credit: Hugo Larochelle
Dropout β’ Idea: Β«crippleΒ» neural network by removing hidden units β’ each hidden unit is set to 0 with probability 0.5 β’ hidden units cannot co-adapt to other units β’ hidden units must be more generally useful Slide credit: Hugo Larochelle
Recommend
More recommend