Lecture 18: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader
Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER
Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) neuron input output node X Y CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) neuron input output node Affine transformation 𝑍 = 𝑔(ℎ) Activation Y X We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) Input layer hidden layer output layer / 𝑌 = 𝑋 𝑨 * = 𝑋 ** 𝑌 * + 𝑋 *+ 𝑌 + + 𝑋 * *1 ℎ * = 𝑔(𝑨 * ) 𝑌 * 𝑋 * ,, 𝑍) , = (ℎ * , ℎ + ) , 𝐾 = ℒ(𝑍 𝑍 𝑍 Output function Loss function 𝑋 𝑌 + + / 𝑌 = 𝑋 𝑨 + = 𝑋 +* 𝑌 * + 𝑋 ++ 𝑌 + + 𝑋 +1 + h + = 𝑔(𝑨 + ) We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli. CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) Input layer hidden layer 1 hidden layer 2 output layer 𝑌 * 𝑋 𝑋 ** +* 𝑍 𝑋 𝑋 𝑌 + *+ ++ CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1 output layer … 𝑌 * 𝑋 𝑋 ** 8* 𝑍 … 𝑋 𝑋 𝑌 + *+ 8+ We will talk later about the choice of the number of layers. CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer 3 nodes 3 nodes 𝑋 8* 𝑋 ** 𝑌 * 𝑍 … 𝑋 𝑋 8+ *+ 𝑌 + 𝑋 𝑋 8: *: CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes 𝑋 8* 𝑋 ** 𝑌 * m nodes 𝑍 … … … 𝑌 + 𝑋 𝑋 8; *; We will talk later about the choice of the number of nodes. CS109A, P ROTOPAPAS , R ADER
Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes Number of inputs d 𝑋 8* 𝑋 ** 𝑌 * m nodes 𝑍 … … … 𝑌 + 𝑋 𝑋 8; *; Number of inputs is specified by the data CS109A, P ROTOPAPAS , R ADER
Why layers? Representation Representation matters! CS109A, P ROTOPAPAS , R ADER
Learning Multiple Components CS109A, P ROTOPAPAS , R ADER
Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER
Neural Networks Hand-written digit recognition: MNIST data CS109A, P ROTOPAPAS , R ADER
Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER
Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER
Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER
Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Ensures not linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER
Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Ensures not linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER
Beyond Linear Models Linear models: • Can be fit efficiently (via convex optimization) • Limited model capacity Alternative: f ( x ) = w T φ ( x ) Where 𝜚 is a non-linear transform CS109A, P ROTOPAPAS , R ADER
Traditional ML Manually engineer 𝜚 • Domain specific, enormous human effort Generic transform • Maps to a higher-dimensional space • Kernel methods: e.g. RBF kernels • Over fitting: does not generalize well to test set • Cannot encode enough prior information CS109A, P ROTOPAPAS , R ADER
Deep Learning Directly learn 𝜚 • 𝑔 𝑦; 𝜄 = 𝑋 / 𝜚(𝑦; 𝜄) where 𝜄 are parameters of the transform • 𝜚 defines hidden layers • Non-convex optimization • • Can encode prior beliefs, generalizes well CS109A, P ROTOPAPAS , R ADER
Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Ensures not linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER
CS109A, P ROTOPAPAS , R ADER
ReLU and Softplus -b/W CS109A, P ROTOPAPAS , R ADER
Generalized ReLU Generalization: For 𝛽 H > 0 𝑦 H , 𝛽 = max 𝑏, 𝑦 H + 𝛽 min{0, 𝑦 H } CS109A, P ROTOPAPAS , R ADER
Maxout Max of k linear functions. Directly learn the activation function. (𝑦) = max H∈{*,…,S} 𝛽 H 𝑦 H + 𝛾 CS109A, P ROTOPAPAS , R ADER
Swish: A Self-Gated Activation Function Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets. 𝑦 = 𝑦 𝜏(𝑦) CS109A, P ROTOPAPAS , R ADER
Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER
Loss Function Cross-entropy between training data and model distribution (i.e. negative log-likelihood) 𝐾 𝑋 = −𝔽 X,Y~[ \ ]^_^ log 𝑞 defgh (y|x) Do not need to design separate loss functions. Gradient of cost function must be large enough CS109A, P ROTOPAPAS , R ADER
Loss Function Example: sigmoid output + squared loss + \ + = y − 𝜏 𝑦 𝑀 jk = 𝑧 − 𝑧 Flat surfaces CS109A, P ROTOPAPAS , R ADER
Cost Function Example: sigmoid output + cross-entropy loss 𝑀 no 𝑧, 𝑧 \ = −{ 𝑧 log 𝑧 \ + 1 − 𝑧 log(1 − 𝑧 \)} CS109A, P ROTOPAPAS , R ADER
Design Choices Activation function Loss function Output units Architecture Optimizer CS109A, P ROTOPAPAS , R ADER
Output Units Output Type Output Distribution Output layer Cost Function Binary CS109A, P ROTOPAPAS , R ADER
Link function 1 𝑌 ⟹ 𝜚 𝑌 = 𝑋 / 𝑌 ⟹ 𝑄 𝑧 = 0 = 1 + 𝑓 u(v) , = P(y = 0) X 𝑍 OUTPUT UNIT , = P(y = 0) X 𝜏(𝜚) 𝑍 𝜚 𝑌 CS109A, P ROTOPAPAS , R ADER
Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy CS109A, P ROTOPAPAS , R ADER
Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete CS109A, P ROTOPAPAS , R ADER
Link function multi-class problem , X 𝑍 OUTPUT UNIT 𝑓 u { v , = 𝑍 X SoftMax 𝜚(𝑌) } ∑ 𝑓 u { v S~* CS109A, P ROTOPAPAS , R ADER
Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy CS109A, P ROTOPAPAS , R ADER
Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE CS109A, P ROTOPAPAS , R ADER
Output Units Output Type Output Distribution Output layer Cost Function Binary Bernoulli Sigmoid Binary Cross Entropy Discrete Multinoulli Softmax Cross Entropy Continuous Gaussian Linear MSE Continuous Arbitrary - GANS CS109A, P ROTOPAPAS , R ADER
Design Choices Activation function Loss function Output units Architecture Optimizer CS109A, P ROTOPAPAS , R ADER
NN in action CS109A, P ROTOPAPAS , R ADER
NN in action CS109A, P ROTOPAPAS , R ADER
NN in action CS109A, P ROTOPAPAS , R ADER
NN in action … CS109A, P ROTOPAPAS , R ADER
NN in action CS109A, P ROTOPAPAS , R ADER
NN in action CS109A, P ROTOPAPAS , R ADER
NN in action CS109A, P ROTOPAPAS , R ADER
Universal Approximation Theorem Think of a Neural Network as function approximation. 𝑍 = 𝑔 𝑦 + 𝜗 €(𝑦) + 𝜗 𝑍 = 𝑔 €(𝑦) NN: ⟹ 𝑔 𝑋 𝑋 : • One hidden layer is enough to represent an depth approximation of any function to an arbitrary degree of accuracy 𝑋 𝑋 So why deeper? * + • Shallow net may need (exponentially) more width • Shallow net may overfit more width CS109A, P ROTOPAPAS , R ADER
Better Generalization with Depth (Goodfellow 2017) CS109A, P ROTOPAPAS , R ADER
Large, Shallow Nets Overfit More (Goodfellow 2017) CS109A, P ROTOPAPAS , R ADER
Recommend
More recommend