lecture 19 anatomy of nn
play

Lecture 19: Anatomy of NN CS109A Introduction to Data Science - PowerPoint PPT Presentation

Lecture 19: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner Outline Anatomy of a NN Design choices Activation function Loss function Output units Architecture CS109A, P ROTOPAPAS


  1. Lecture 19: Anatomy of NN CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader and Chris Tanner

  2. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  3. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  4. Anatomy of artificial neural network (ANN) neuron input output node W X Y CS109A, P ROTOPAPAS , R ADER , T ANNER

  5. Anatomy of artificial neural network (ANN) neuron input output node Affine transformation 𝑍 = 𝑔(ℎ) Activation Y X We will talk later about the choice of activation function. So far we have only talked about sigmoid as an activation function but there are other choices. CS109A, P ROTOPAPAS , R ADER , T ANNER

  6. Anatomy of artificial neural network (ANN) Input layer hidden layer output layer / 𝑌 = 𝑋 𝑨 * = 𝑋 ** 𝑌 * + 𝑋 *+ 𝑌 + + 𝑋 * *1 ℎ * = 𝑔(𝑨 * ) 𝑌 * 𝑋 * ,, 𝑍) , = 𝑕(ℎ * , ℎ + ) , 𝐾 = ℒ(𝑍 𝑍 𝑍 Output function Loss function 𝑋 𝑌 + + / 𝑌 = 𝑋 𝑨 + = 𝑋 +* 𝑌 * + 𝑋 ++ 𝑌 + + 𝑋 +1 + h + = 𝑔(𝑨 + ) We will talk later about the choice of the output layer and the loss function. So far we consider sigmoid as the output and log-bernouli. CS109A, P ROTOPAPAS , R ADER , T ANNER

  7. Anatomy of artificial neural network (ANN) Input layer hidden layer 1 hidden layer 2 output layer 𝑌 * 𝑋 𝑋 ** +* 𝑍 𝑋 𝑋 𝑌 + *+ ++ CS109A, P ROTOPAPAS , R ADER , T ANNER

  8. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1 output layer … 𝑌 * 𝑋 𝑋 ** 8* 𝑍 … 𝑋 𝑋 𝑌 + *+ 8+ We will talk later about the choice of the number of layers. CS109A, P ROTOPAPAS , R ADER , T ANNER

  9. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer 3 nodes 3 nodes 𝑋 8* 𝑋 ** 𝑌 * … 𝑍 𝑋 𝑋 8+ *+ 𝑌 + 𝑋 𝑋 8: *: CS109A, P ROTOPAPAS , R ADER , T ANNER

  10. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes 𝑋 8* 𝑋 ** 𝑌 * m nodes … 𝑍 … … 𝑌 + 𝑋 𝑋 8; *; We will talk later about the choice of the number of nodes. CS109A, P ROTOPAPAS , R ADER , T ANNER

  11. Anatomy of artificial neural network (ANN) Input layer hidden layer n hidden layer 1, output layer m nodes Number of inputs d 𝑋 8* 𝑋 ** 𝑌 * m nodes … 𝑍 … … 𝑌 + 𝑋 𝑋 8; *; Number of inputs is specified by the data CS109A, P ROTOPAPAS , R ADER , T ANNER

  12. Anatomy of artificial neural network (ANN) output layer input layer hidden layer 1 hidden layer 2 CS109A, P ROTOPAPAS , R ADER , T ANNER

  13. Anatomy of artificial neural network (ANN) output layer input layer hidden layer 1 hidden layer 2 CS109A, P ROTOPAPAS , R ADER , T ANNER

  14. Why layers? Representation Representation matters! CS109A, P ROTOPAPAS , R ADER , T ANNER

  15. Learning Multiple Components CS109A, P ROTOPAPAS , R ADER , T ANNER

  16. Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER , T ANNER

  17. Neural Networks Hand-written digit recognition: MNIST data CS109A, P ROTOPAPAS , R ADER , T ANNER

  18. Depth = Repeated Compositions CS109A, P ROTOPAPAS , R ADER , T ANNER

  19. Beyond Linear Models Linear models: • Can be fit efficiently (via convex optimization) • Limited model capacity Alternative: f ( x ) = w T φ ( x ) Where 𝜚 is a non-linear transform CS109A, P ROTOPAPAS , R ADER , T ANNER

  20. Traditional ML Manually engineer 𝜚 • Domain specific, enormous human effort Generic transform • Maps to a higher-dimensional space • Kernel methods: e.g. RBF kernels • Over fitting: does not generalize well to test set • Cannot encode enough prior information CS109A, P ROTOPAPAS , R ADER , T ANNER

  21. Deep Learning Directly learn 𝜚 • 𝑔 𝑦; 𝜄 = 𝑋 / 𝜚(𝑦; 𝜄) 𝜚 𝑦; 𝜄 is an automatically-learned re resentation of x • repre For deep networks , 𝜚 is the function learned by the hidden layers of the • network 𝜄 are the learned weights • Non-convex optimization • • Can encode prior beliefs, generalizes well CS109A, P ROTOPAPAS , R ADER , T ANNER

  22. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  23. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  24. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Provide non-linearity • • Ensure gradients remain large through hidden unit Common choices are • Sigmoid Relu, leaky ReLU, Generalized ReLU, MaxOut • • softplus tanh • • swish CS109A, P ROTOPAPAS , R ADER , T ANNER

  25. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Provide non-linearity • • Ensure gradients remain large through hidden unit Common choices are • sigmoid tanh • • ReLU, leaky ReLU, Generalized ReLU, MaxOut softplus • • swish • CS109A, P ROTOPAPAS , R ADER , T ANNER

  26. Activation function ℎ = 𝑔(𝑋 / 𝑌 + 𝑐) The activation function should: Provide non-linearity • • Ensure gradients remain large through hidden unit Common choices are • sigmoid tanh • • ReLU, leaky ReLU, Generalized ReLU, MaxOut softplus • • swish CS109A, P ROTOPAPAS , R ADER , T ANNER

  27. Sigmoid (aka Logistic) 1 𝑧 = 1 + 𝑓 UV Derivative is zero for much of the domain. This leads to “ vanishing gradients ” in backpropagation. CS109A, P ROTOPAPAS , R ADER , T ANNER

  28. Hyperbolic Tangent (Tanh) 𝑧 = 𝑓 V − 𝑓 UV 𝑓 V + 𝑓 UV Same problem of “ vanishing gradients ” as sigmoid. CS109A, P ROTOPAPAS , R ADER , T ANNER

  29. Rectified Linear Unit (ReLU) 𝑧 = max (0, 𝑦) Two major advantages: 1. No vanishing gradient when x > 0 2. Provides sparsity (regularization) since y = 0 when x < 0 CS109A, P ROTOPAPAS , R ADER , T ANNER

  30. Leaky ReLU 𝑧 = max 0, 𝑦 + 𝛽 min(0,1) where 𝛽 takes a small value • Tries to fix “ dying ReLU ” problem: derivative is non-zero everywhere. • Some people report success with this form of activation function, but the results are not always consistent CS109A, P ROTOPAPAS , R ADER , T ANNER

  31. Generalized ReLU Generalization: For 𝛽 Z > 0 𝑕 𝑦 Z , 𝛽 = max 𝑏, 𝑦 Z + 𝛽 min{0, 𝑦 Z } CS109A, P ROTOPAPAS , R ADER , T ANNER

  32. softplus 𝑧 = log(1 + 𝑓 V ) The logistic sigmoid function is a smooth approximation of the derivative of the rectifier CS109A, P ROTOPAPAS , R ADER , T ANNER

  33. Maxout Max of k linear functions. Directly learn the activation function. 𝑕(𝑦) = max Z∈{*,…,b} 𝛽 Z 𝑦 Z + 𝛾 CS109A, P ROTOPAPAS , R ADER , T ANNER

  34. Swish: A Self-Gated Activation Function 𝑕 𝑦 = 𝑦 𝜏(𝑦) Currently, the most successful and widely-used activation function is the ReLU. Swish tends to work better than ReLU on deeper models across a number of challenging datasets. CS109A, P ROTOPAPAS , R ADER , T ANNER

  35. Outline Anatomy of a NN Design choices • Activation function • Loss function • Output units • Architecture CS109A, P ROTOPAPAS , R ADER , T ANNER

  36. � � Loss Function Likelihood for a given point: 𝑞 𝑧 Z 𝑋; 𝑦 Z Assume independency, likelihood for all measurements: 𝑀 𝑋; 𝑌, 𝑍 = 𝑞 𝑍 𝑋; 𝑌 = g 𝑞 𝑧 Z 𝑋; 𝑦 Z Z Maximize the likelihood, or equivalently maximize the log-likelihood: log 𝑀(𝑋; 𝑌, 𝑍) = i log 𝑞 𝑧 Z 𝑋; 𝑦 Z Z Turn this into a loss function: ℒ 𝑋; 𝑌, 𝑍 = − log 𝑀(𝑋; 𝑌, 𝑍) CS109A, P ROTOPAPAS , R ADER , T ANNER

  37. � � Loss Function Do not need to design separate loss functions if we follow this simple procedure Examples: • Distribution is Normal then likelihood is: q p r 1 𝐍𝐓𝐅 √ 2𝜌 + 𝜏 𝑓 U o p Uo 𝑞 𝑧 Z 𝑋; 𝑦 Z = +s^+ q Z + ℒ 𝑋; 𝑌, 𝑍 = ∑ 𝑧 Z − 𝑧 Z • Distribution is Bernouli then likelihood is: o p 1 − 𝑞 Z *Uo p Cross-Entropy 𝑞 𝑧 Z 𝑋; 𝑦 Z = 𝑞 Z ℒ 𝑋; 𝑌, 𝑍 = − ∑ 𝑧 Z log 𝑞 Z + (1 − 𝑧 Z ) log(1 − 𝑞 Z ) Z CS109A, P ROTOPAPAS , R ADER , T ANNER

  38. Design Choices Activation function Loss function Output units Architecture Optimizer CS109A, P ROTOPAPAS , R ADER , T ANNER

  39. Output Units Output Type Output Distribution Output layer Loss Function Binary CS109A, P ROTOPAPAS , R ADER , T ANNER

  40. Output Units Output Type Output Distribution Output layer Loss Function Binary Bernoulli CS109A, P ROTOPAPAS , R ADER , T ANNER

  41. Output Units Output Type Output Distribution Output layer Loss Function Binary Bernoulli Binary Cross Entropy CS109A, P ROTOPAPAS , R ADER , T ANNER

  42. Output Units Output Type Output Distribution Output layer Loss Function Binary Bernoulli ? Binary Cross Entropy CS109A, P ROTOPAPAS , R ADER , T ANNER

Recommend


More recommend