ece 417 fall 2018 lecture 17 neural networks
play

ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Intro Design Nonlinearities Metric Gradient ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois October 23, 2018 Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1


  1. Intro Design Nonlinearities Metric Gradient ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois October 23, 2018

  2. Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1 Knowledge-Based Design 2 Nonlinearities 3 Error Metric 4 Gradient Descent 5

  3. Intro Design Nonlinearities Metric Gradient Two-Layer Feedforward Neural Network z = h ( � � x , U , V ) which is decomposed as. . . . . . z = g ( � z 1 z 2 z r z ℓ = g ( b ℓ ) � b ) b ℓ = v k 0 + � q � k =1 v ℓ k y k b = V � y . . . y q y 1 y 2 1 y k = f ( a k ) y = f ( � � a ) a k = u k 0 + � p j =1 u kj x j � a = U � x . . . x p x 1 x 2 � x is the input vector 1

  4. Intro Design Nonlinearities Metric Gradient A Neural Net is Made Of. . . x , � Linear transformations: � a = U � b = V � y , one per layer. Scalar nonlinearities: � y = f ( � a ) means that, element-by-element, y k = f ( a k ) for some nonlinear function f ( · ). The nonlinearities can all be different, if you want. For today, I’ll assume that all nodes in the first layer use one function f ( · ), and all nodes in the second layer use some other function g ( · ). Networks with more than two layers are called “Deep Neural Networks” (DNN). I won’t talk about them today. Andrew Barron (1993) proved that combining two layers of linear transforms, with one scalar nonlinearity between them, is enough to model any multivariate nonlinear function � z = h ( � x ).

  5. Intro Design Nonlinearities Metric Gradient Neural Network = Universal Approximator Assume. . . Linear Output Nodes: g ( b ) = b Smoothly Nonlinear Hidden Nodes: f ′ ( a ) = df da finite Smooth Target Function: � z = h ( � x , U , V ) approximates � ζ = h ∗ ( � x ) ∈ H , where H is some class of sufficiently smooth functions of � x (functions whose Fourier transform has a first moment less than some finite number C ) There are q hidden nodes, y k , 1 ≤ k ≤ q The input vectors are distributed with some probability density function, p ( � x ), over which we can compute expected values. Then (Barron, 1993) showed that. . . � 1 � x ) | 2 � � x , U , V ) − h ∗ ( � max x ) ∈H min h ( � ≤ O U , V E q h ∗ ( �

  6. Intro Design Nonlinearities Metric Gradient Neural Network Problems: Outline of Remainder of this Talk 1 Knowledge-Based Design. Given U , V , f , g , what kind of function is h ( � x , U , V )? Can we draw � z as a function of � x ? Can we heuristically choose U and V so that � z looks kinda like � ζ ? 2 Nonlinearities. They come in pairs: the test-time nonlinearity, and the training-time nonlinearity. 3 Error Metric. In what way should � z = h ( � x ) be “similar to” � ζ = h ∗ ( � x )? 4 Training: Gradient Descent with Back-Propagation. Given an initial U , V , how do I find ˆ U , ˆ V that more closely approximate � ζ ?

  7. Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1 Knowledge-Based Design 2 Nonlinearities 3 Error Metric 4 Gradient Descent 5

  8. Intro Design Nonlinearities Metric Gradient Synapse, First Layer: a k = u k 0 + � 2 j =1 u kj x j

  9. Intro Design Nonlinearities Metric Gradient Axon, First Layer: y k = tanh( a k )

  10. Intro Design Nonlinearities Metric Gradient Synapse, Second Layer: b ℓ = v ℓ 0 + � 2 k =1 v ℓ k y k

  11. Intro Design Nonlinearities Metric Gradient Axon, Second Layer: z ℓ = sign( b ℓ )

  12. Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1 Knowledge-Based Design 2 Nonlinearities 3 Error Metric 4 Gradient Descent 5

  13. Intro Design Nonlinearities Metric Gradient Differentiable and Non-differentiable Nonlinearities The nonlinearities come in pairs: (1) the test-time nonlinearity is the one that you use in the output layer of your learned classifier , e.g., in the app on your cell phone (2) the training-time nonlinearity is used in the output layer during training, and in the hidden layers during both training and test. Application Test-Time Training-Time Output Output & Hidden Nonlinearity Nonlinearity { 0 , 1 } classification step logistic or ReLU {− 1 , +1 } classification signum tanh multinomial classification argmax softmax regression linear (hidden nodes must be nonlinear)

  14. Intro Design Nonlinearities Metric Gradient Step and Logistic nonlinearities Signum and Tanh nonlinearities

  15. Intro Design Nonlinearities Metric Gradient “Linear Nonlinearity” and ReLU Argmax and Softmax Argmax: � 1 b ℓ = max m b m z ℓ = 0 otherwise Softmax: e b ℓ z ℓ = � m e b m

  16. Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1 Knowledge-Based Design 2 Nonlinearities 3 Error Metric 4 Gradient Descent 5

  17. Intro Design Nonlinearities Metric Gradient Error Metric: MMSE for Linear Output Nodes Minimum Mean Squared Error (MMSE) n U ∗ , V ∗ = arg min E = arg min 1 � | � z ( x i ) | 2 ζ i − � 2 n i =1 Why would we want to use this metric? x i , � If the training samples ( � ζ i ) are i.i.d., then in the limit as the number of training tokens goes to infinity, � � � h ( � x ) → E ζ | � x

  18. Intro Design Nonlinearities Metric Gradient Error Metric: MMSE for Binary Target Vector Binary target vector Suppose � 1 with probability P ℓ ( � x ) ζ ℓ = 0 with probability 1 − P ℓ ( � x ) and suppose 0 ≤ z ℓ ≤ 1, e.g., logistic output nodes. Why does MMSE make sense for binary targets? E [ ζ ℓ | � x ] = 1 · P ℓ ( � x ) + 0 · (1 − P ℓ ( � x )) = P ℓ ( � x ) So the MMSE neural network solution is � � � h ( � x ) → E ζ | � x = P ℓ ( � x )

  19. Intro Design Nonlinearities Metric Gradient Softmax versus Logistic Output Nodes Encoding the Neural Net Output using a “One-Hot Vector” Suppose � ζ i is a “one hot” vector, i.e., only one element is “hot” ( ζ ℓ ( i ) , i = 1), all others are “cold” ( ζ mi = 0, m � = ℓ ( i )). Training logistic output nodes with MMSE training will approach the solution z ℓ = Pr { ζ ℓ = 1 | � x } , but there’s no guarantee that it’s a correctly normalized pmf ( � z ℓ = 1) until it has fully converged. Softmax output nodes guarantee that � z ℓ = 1. Softmax output nodes e b ℓ z ℓ = � m e b m

  20. Intro Design Nonlinearities Metric Gradient Cross-Entropy The softmax nonlinearity is “matched” to an error criterion called “cross-entropy,” in the sense that its derivative can be simplified to have a very, very simple form. ζ ℓ, i is the true reference probability that observation � x i is of class ℓ . In most cases, this “reference probability” is either 0 or 1 (one-hot). z ℓ, i is the neural network’s hypothesis about the probability that � x i is of class ℓ . The softmax function constrains this to be 0 ≤ z ℓ, i ≤ 1 and � ℓ z ℓ, i = 1. The average cross-entropy between these two distributions is n E = − 1 � � ζ ℓ, i log z ℓ, i n i =1 ℓ

  21. Intro Design Nonlinearities Metric Gradient Cross-Entropy = Log Probability x i is of class ℓ ∗ , meaning that ζ ℓ ∗ , i = 1, and all Suppose token � others are zero. Then cross-entropy is just the neural net’s estimate of the negative log probability of the correct class: n E = − 1 � log z ℓ ∗ , i n i =1 In other words, E is the average of the negative log probability of each training token: n E = − 1 � E i , E i = − log z ℓ ∗ , i n i =1

  22. Intro Design Nonlinearities Metric Gradient Cross-Entropy is matched to softmax Now let’s plug in the softmax: e b ℓ ∗ , i E i = − log z ℓ ∗ , i , z ℓ ∗ , i = � k e b ki Its gradient with respect to the softmax inputs, b mi , is = − 1 ∂ E i ∂ z ℓ ∗ , i ∂ b mi z ℓ ∗ , i ∂ b mi  � 2 � � � e b ℓ ∗ , i e b ℓ ∗ , i  1 m = ℓ ∗ − k e bki −   2 z ℓ ∗ , i � ( k e bki )  � = � � − e b ℓ ∗ , i e bmi 1  m � = ℓ ∗ −   2 z ℓ ∗ , i  ( k e bki ) � = z mi − ζ mi

  23. Intro Design Nonlinearities Metric Gradient Error Metrics Summarized � � � Use MSE to achieve � z = E ζ | � . That’s almost always what x you want. If � ζ is a one-hot vector, then use Cross-Entropy (with a softmax nonlinearity on the output nodes) to guarantee that � z is a properly normalized probability mass function, and because it gives you the amazingly easy formula ∂ E i ∂ b mi = z mi − ζ mi . If ζ ℓ is binary, but not necessarily one-hot, then use MSE (with a logistic nonlinearity) to achieve z ℓ = Pr { ζ ℓ = 1 | � x } .

  24. Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1 Knowledge-Based Design 2 Nonlinearities 3 Error Metric 4 Gradient Descent 5

  25. Intro Design Nonlinearities Metric Gradient Gradient Descent = Local Optimization

Recommend


More recommend