Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824

Neural Networks • Origins: Algorithms that try to mimic the brain. What is this?

A single neuron in the brain Input Output Slide credit: Andrew Ng

An artificial neuron: Logistic unit 𝑦 0 𝜄 0 “Bias unit” 𝑦 0 𝑦 1 𝜄 1 𝑦 = 𝜄 = 𝑦 2 𝜄 2 “Weights” 𝑦 3 𝑦 1 𝜄 3 “Parameters” “Output” 1 𝑦 2 ℎ 𝜄 𝑦 = 1 + 𝑓 −𝜄 ⊤ 𝑦 𝑦 3 • Sigmoid (logistic) activation function “Input” Slide credit: Andrew Ng

Visualization of weights, bias, activation function range determined by g(.) bias b only change the position of the hyperplane Slide credit: Hugo Larochelle

Activation - sigmoid • Squashes the neuron’s pre - activation between 0 and 1 • Always positive • Bounded • Strictly increasing 1 𝑕 𝑦 = 1 + 𝑓 −𝑦 Slide credit: Hugo Larochelle

Activation - hyperbolic tangent (tanh) • Squashes the neuron’s pre - activation between -1 and 1 • Can be positive or negative • Bounded • Strictly increasing 𝑕 𝑦 = tanh 𝑦 = 𝑓 𝑦 − 𝑓 −𝑦 𝑓 𝑦 + 𝑓 −𝑦 Slide credit: Hugo Larochelle

Activation - rectified linear(relu) • Bounded below by 0 • always non-negative • Not upper bounded • Tends to give neurons with sparse activities 𝑕 𝑦 = relu 𝑦 = max 0, 𝑦 Slide credit: Hugo Larochelle

Activation - softmax • For multi-class classification: • we need multiple outputs (1 output per class) • we would like to estimate the conditional probability 𝑞 𝑧 = 𝑑 | 𝑦 • We use the softmax activation function at the output: 𝑓 𝑦 1 𝑓 𝑦 𝑑 𝑕 𝑦 = softmax 𝑦 = σ 𝑑 𝑓 𝑦 𝑑 … σ 𝑑 𝑓 𝑦 𝑑 Slide credit: Hugo Larochelle

Universal approximation theorem ‘‘a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units ’’ Hornik, 1991 Slide credit: Hugo Larochelle

Neural network – Multilayer (2) 𝑦 0 𝑏 0 (2) 𝑦 1 𝑏 1 “Output” (2) 𝑦 2 ℎ Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 Layer 3 Layer 1 Layer 2 (hidden) Slide credit: Andrew Ng

Neural network (𝑘) = “activation” of unit 𝑗 in layer 𝑘 𝑏 𝑗 (2) 𝑦 0 𝑏 0 Θ 𝑘 = matrix of weights controlling (2) 𝑦 1 𝑏 1 function mapping from layer 𝑘 to layer 𝑘 + 1 (2) 𝑦 2 ℎ Θ 𝑦 𝑏 2 𝑡 𝑘 unit in layer 𝑘 (2) 𝑦 3 𝑏 3 𝑡 𝑘+1 units in layer 𝑘 + 1 (2) = 𝑕 Θ 10 (1) 𝑦 0 + Θ 11 (1) 𝑦 1 + Θ 12 (1) 𝑦 2 + Θ 13 (1) 𝑦 3 𝑏 1 Size of Θ 𝑘 ? (2) = 𝑕 Θ 20 (1) 𝑦 0 + Θ 21 (1) 𝑦 1 + Θ 22 (1) 𝑦 2 + Θ 23 (1) 𝑦 3 𝑏 2 (2) = 𝑕 Θ 30 (1) 𝑦 0 + Θ 31 (1) 𝑦 1 + Θ 32 (1) 𝑦 2 + Θ 33 (1) 𝑦 3 𝑏 3 𝑡 𝑘+1 × (𝑡 𝑘 + 1) (2) + Θ 11 (2) + Θ 12 (2) + Θ 13 (2) 𝑏 0 (1) 𝑏 1 (1) 𝑏 2 (1) 𝑏 3 (2) ℎ Θ (𝑦) = 𝑕 Θ 10 Slide credit: Andrew Ng

Neural network “Pre - activation” 𝑦 0 (2) z 1 (2) 𝑦 0 𝑏 0 𝑦 1 z (2) = (2) 𝑦 = z 2 𝑦 2 (2) 𝑦 1 𝑏 1 (2) 𝑦 3 z 3 (2) 𝑦 2 ℎ Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 Why do we need g(.)? (2) = 𝑕 Θ 10 (1) 𝑦 0 + Θ 11 (1) 𝑦 1 + Θ 12 (1) 𝑦 2 + Θ 13 (1) 𝑦 3 (2) ) 𝑏 1 = 𝑕(z 1 (2) = 𝑕 Θ 20 (1) 𝑦 0 + Θ 21 (1) 𝑦 1 + Θ 22 (1) 𝑦 2 + Θ 23 (1) 𝑦 3 (2) ) 𝑏 2 = 𝑕(z 2 (2) = 𝑕 Θ 30 (1) 𝑦 0 + Θ 31 (1) 𝑦 1 + Θ 32 (1) 𝑦 2 + Θ 33 (1) 𝑦 3 (2) ) 𝑏 3 = 𝑕(z 3 2 𝑏 0 2 + Θ 11 1 𝑏 1 2 + Θ 12 1 𝑏 2 2 + Θ 13 1 𝑏 3 2 = 𝑕(𝑨 (3) ) ℎ Θ 𝑦 = 𝑕 Θ 10 Slide credit: Andrew Ng

Neural network “Pre - activation” 𝑦 0 (2) z 1 (2) 𝑦 0 𝑏 0 𝑦 1 z (2) = (2) 𝑦 = z 2 𝑦 2 (2) 𝑦 1 𝑏 1 (2) 𝑦 3 z 3 (2) 𝑦 2 ℎ Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 𝑨 (2) = Θ (1) 𝑦 = Θ (1) 𝑏 (1) (2) = 𝑕(z 1 (2) ) 𝑏 1 𝑏 (2) = 𝑕(𝑨 (2) ) (2) = 𝑕(z 2 (2) ) 𝑏 2 (2) = 1 Add 𝑏 0 (2) = 𝑕(z 3 (2) ) 𝑏 3 𝑨 (3) = Θ (2) 𝑏 (2) ℎ Θ 𝑦 = 𝑕(𝑨 (3) ) ℎ Θ 𝑦 = 𝑏 (3) = 𝑕(𝑨 (3) ) Slide credit: Andrew Ng

Flow graph - Forward propagation 𝑨 (2) 𝑨 (3) 𝑏 (2) 𝑏 (3) X ℎ Θ 𝑦 How do we evaluate 𝑋 (1) 𝑐 (1) 𝑋 (2) 𝑐 (2) our prediction? 𝑨 (2) = Θ (1) 𝑦 = Θ (1) 𝑏 (1) 𝑏 (2) = 𝑕(𝑨 (2) ) (2) = 1 Add 𝑏 0 𝑨 (3) = Θ (2) 𝑏 (2) ℎ Θ 𝑦 = 𝑏 (3) = 𝑕(𝑨 (3) )

Cost function Logistic regression: Neural network: Slide credit: Andrew Ng

Gradient computation Need to compute: Slide credit: Andrew Ng

Gradient computation Given one training example 𝑦, 𝑧 𝑏 (1) = 𝑦 𝑨 (2) = Θ (1) 𝑏 (1) 𝑏 (2) = 𝑕(𝑨 (2) ) (add a 0 (2) ) 𝑨 (3) = Θ (2) 𝑏 (2) 𝑏 (3) = 𝑕(𝑨 (3) ) (add a 0 (3) ) 𝑨 (4) = Θ (3) 𝑏 (3) 𝑏 (4) = 𝑕 𝑨 4 = ℎ Θ 𝑦 Slide credit: Andrew Ng

Gradient computation: Backpropagation (𝑚) = “error” of node 𝑘 in layer 𝑚 Intuition: 𝜀 𝑘 For each output unit (layer L = 4) 𝑨 (3) = Θ (2) 𝑏 (2) 𝜀 (4) = 𝑏 (4) − 𝑧 𝑏 (3) = 𝑕(𝑨 (3) ) 𝜀 (3) = 𝜀 (4) 𝜖 𝜀 (4) 𝜖𝑨 (3) = 𝜀 (4) 𝜖 𝜀 (4) 𝜖𝑏 (4) 𝜖𝑨 (4) 𝜖𝑏 (3) 𝑨 (4) = Θ (3) 𝑏 (3) 𝜖𝑏 (4) 𝜖𝑨 (4) 𝜖𝑏 (3) 𝜖𝑨 (3) 𝑏 (4) = 𝑕 𝑨 4 = 1 * Θ 3 𝑈 𝜀 (4) .∗ 𝑕′ 𝑨 4 .∗ 𝑕′(𝑨 (3) ) Slide credit: Andrew Ng

Backpropagation algorithm 𝑦 (1) , 𝑧 (1) … 𝑦 (𝑛) , 𝑧 (𝑛) Training set Set Θ (1) = 0 For 𝑗 = 1 to 𝑛 Set 𝑏 (1) = 𝑦 Perform forward propagation to compute 𝑏 (𝑚) for 𝑚 = 2. . 𝑀 use 𝑧 (𝑗) to compute 𝜀 (𝑀) = 𝑏 (𝑀) − 𝑧 (𝑗) Compute 𝜀 (𝑀−1) , 𝜀 (𝑀−2) … 𝜀 (2) Θ (𝑚) = Θ (𝑚) − 𝑏 (𝑚) 𝜀 (𝑚+1) Slide credit: Andrew Ng

Activation - sigmoid • Partial derivative 𝑕′ 𝑦 = 𝑕 𝑦 1 − 𝑕 𝑦 1 𝑕 𝑦 = 1 + 𝑓 −𝑦 Slide credit: Hugo Larochelle

Activation - hyperbolic tangent (tanh) • Partial derivative 𝑕′ 𝑦 = 1 − 𝑕 𝑦 2 𝑕 𝑦 = tanh 𝑦 = 𝑓 𝑦 − 𝑓 −𝑦 𝑓 𝑦 + 𝑓 −𝑦 Slide credit: Hugo Larochelle

Activation - rectified linear(relu) • Partial derivative 𝑕′ 𝑦 = 1 𝑦 > 0 𝑕 𝑦 = relu 𝑦 = max 0, 𝑦 Slide credit: Hugo Larochelle

Initialization • For bias • Initialize all to 0 • For weights • Can’t initialize all weights to the same value • we can show that all hidden units in a layer will always behave the same • need to break symmetry • Recipe: U[-b, b] • the idea is to sample around 0 but break symmetry Slide credit: Hugo Larochelle

Putting it together Pick a network architecture • No. of input units: Dimension of features • No. output units: Number of classes • Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) • Grid search Slide credit: Hugo Larochelle

Putting it together Early stopping • Use a validation set performance to select the best configuration • To select the number of epochs, stop training when validation set error increases Slide credit: Hugo Larochelle

Other tricks of the trade • Normalizing your (real-valued) data • Decaying the learning rate • as we get closer to the optimum, makes sense to take smaller update steps • mini-batch • can give a more accurate estimate of the risk gradient • Momentum • can use an exponential average of previous gradients Slide credit: Hugo Larochelle

Dropout • Idea: «cripple» neural network by removing hidden units • each hidden unit is set to 0 with probability 0.5 • hidden units cannot co-adapt to other units • hidden units must be more generally useful Slide credit: Hugo Larochelle

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824 Neural Networks Origins: Algorithms that try to mimic the brain. What is this? A single neuron in the brain Input Output Slide credit: Andrew Ng An artificial

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Professional Software Development 040coders.nl 2018-10-18 Klaas van Gend Klaas van Gend C++

a Very Short Introduction to AngularJS Lecture 11 CGS 3066 Fall 2016 November 8, 2016

CSP Is Dead, Long Live Strict CSP! Lukas Weichselbaum About Us Lukas Weichselbaum Michele

Indigenous Institutions and Capacity in Tanzania Evans Osabuohien, Uchenna Efobi, Ciliaka Gitau,

Ordinarization Transform of a Numerical Semigroup Maria Bras-Amor os Universitat Rovira i

Performance, Correctness, Exceptions: Pick Three Andrea Gussoni , Alessandro Di Federico, Pietro

Python for Data Processing and Plo3ng Han-Wei Shen The Ohio State University (with help from

no more downgrades : protec)ng TLS from legacy crypto karthik bhargavan INRIA joint work with: