lecture 7 neural nets
play

Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia - PowerPoint PPT Presentation

Intro Example #1 Example #2 Learning Backprop Example Summary Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020 Intro Example #1 Example #2 Learning Backprop Example Summary Intro 1


  1. Intro Example #1 Example #2 Learning Backprop Example Summary Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

  2. Intro Example #1 Example #2 Learning Backprop Example Summary Intro 1 Example #1: Neural Net as Universal Approximator 2 Example #2: Semicircle → Parabola 3 Learning: Gradient Descent and Back-Propagation 4 Backprop Example: Semicircle → Parabola 5 Summary 6

  3. Intro Example #1 Example #2 Learning Backprop Example Summary Outline Intro 1 Example #1: Neural Net as Universal Approximator 2 Example #2: Semicircle → Parabola 3 Learning: Gradient Descent and Back-Propagation 4 Backprop Example: Semicircle → Parabola 5 Summary 6

  4. Intro Example #1 Example #2 Learning Backprop Example Summary What is a Neural Network? Computation in biological neural networks is performed by trillions of simple cells (neurons), each of which performs one very simple computation. Biological neural networks learn by strengthening the connections between some pairs of neurons, and weakening other connections.

  5. Intro Example #1 Example #2 Learning Backprop Example Summary What is an Artificial Neural Network? Computation in an artificial neural network is performed by thousands of simple cells (nodes), each of which performs one very simple computation. Artificial neural networks learn by strengthening the connections between some pairs of nodes, and weakening other connections.

  6. Intro Example #1 Example #2 Learning Backprop Example Summary Two-Layer Feedforward Neural Network x , W (1) , � b (1) , W (2) , � b (2) ) y = h ( � ˆ . . . y k = e (2) y 1 ˆ y 2 ˆ y K ˆ ˆ k e (2) = b (2) + � N j =1 w (2) kj h j k k . . . h k = σ ( e (1) 1 h 1 h 2 h N k ) e (1) = b (1) j =1 w (1) + � D kj x j k k . . . x 1 x 2 x D x is the input vector � 1

  7. Intro Example #1 Example #2 Learning Backprop Example Summary Neural Network = Universal Approximator Assume. . . y k = e (2) Linear Output Nodes: ˆ k Smoothly Nonlinear Hidden Nodes: d σ de finite Smooth Target Function: ˆ y = h ( � x , W , b ) approximates y = h ∗ ( � � x ) ∈ H , where H is some class of sufficiently smooth functions of � x (functions whose Fourier transform has a first moment less than some finite number C ) There are N hidden nodes, ˆ y k , 1 ≤ k ≤ N The input vectors are distributed with some probability density function, p ( � x ), over which we can compute expected values. Then (Barron, 1993) showed that. . . � 1 � x ) | 2 � � x , W , b ) − h ∗ ( � max x ) ∈H min h ( � ≤ O W , b E N h ∗ ( �

  8. Intro Example #1 Example #2 Learning Backprop Example Summary Outline Intro 1 Example #1: Neural Net as Universal Approximator 2 Example #2: Semicircle → Parabola 3 Learning: Gradient Descent and Back-Propagation 4 Backprop Example: Semicircle → Parabola 5 Summary 6

  9. Intro Example #1 Example #2 Learning Backprop Example Summary Target: Can we get the neural net to compute this function? Suppose our goal is to find some weights and biases, W (1) , � b (1) , b (2) so that ˆ W (2) , and � y ( � x ) is the nonlinear function shown here:

  10. Intro Example #1 Example #2 Learning Backprop Example Summary Excitation, First Layer: e (1) = b (1) j =1 w (1) + � 2 kj x j k k The first layer of the neural net just computes a linear function of � x . Here’s an example:

  11. Intro Example #1 Example #2 Learning Backprop Example Summary Activation, First Layer: h k = tanh( e (1) k ) The activation nonlinearity then “squashes” the linear function:

  12. Intro Example #1 Example #2 Learning Backprop Example Summary y k = b (2) j =1 w (2) + � 2 Second Layer: ˆ kj h k k The second layer then computes a linear combination of the first-layer activations, which is sufficient to match our desired function:

  13. Intro Example #1 Example #2 Learning Backprop Example Summary Outline Intro 1 Example #1: Neural Net as Universal Approximator 2 Example #2: Semicircle → Parabola 3 Learning: Gradient Descent and Back-Propagation 4 Backprop Example: Semicircle → Parabola 5 Summary 6

  14. Intro Example #1 Example #2 Learning Backprop Example Summary Example #2: Semicircle → Parabola Can we design a neural net that converts a semicircle ( x 2 0 + x 2 1 = 1) to a parabola ( y 1 = y 2 0 )?

  15. Intro Example #1 Example #2 Learning Backprop Example Summary Two-Layer Feedforward Neural Network x , W (1) , � b (1) , W (2) , � b (2) ) y = h ( � ˆ . . . y k = e (2) y 1 ˆ y 2 ˆ y K ˆ ˆ k e (2) = b (2) + � N j =1 w (2) kj h j k k . . . h k = σ ( e (1) 1 h 1 h 2 h N k ) e (1) = b (1) j =1 w (1) + � D kj x j k k . . . x 1 x 2 x D x is the input vector � 1

  16. Intro Example #1 Example #2 Learning Backprop Example Summary Example #2: Semicircle → Parabola Let’s define some vector notation: � w (2) � , the j th column of w (2) 0 j Second Layer: Define � = j w (2) 1 j the W (2) matrix, so that w (2) w (2) y = � � � ˆ b + � h j means y k = b k + ˆ kj h j ∀ k . j j j First Layer Activation Function: � � e (1) h k = σ k w (1) = [ w (1) k 0 , w (1) k 1 ], the k th First Layer Excitation: Define ¯ k row of the W (1) matrix, so that e (1) w (1) e (1) w (1) � = ¯ means = kj x j ∀ k . k � x k k j

  17. Intro Example #1 Example #2 Learning Backprop Example Summary Second Layer = Piece-Wise Approximation The second layer of the network approximates ˆ y using a bias term w (2) � b , plus correction vectors � , each scaled by its activation h j : j b (2) + � w (2) y = � ˆ � h j j j The activation, h j , is a number between 0 and 1. For example, we could use the logistic sigmoid function: 1 � � e (1) h k = σ = ∈ (0 , 1) k 1 + exp( − e (1) k ) The logistic sigmoid is a differentiable approximation to a unit step function.

  18. Intro Example #1 Example #2 Learning Backprop Example Summary Step and Logistic nonlinearities Signum and Tanh nonlinearities

  19. Intro Example #1 Example #2 Learning Backprop Example Summary First Layer = A Series of Decisions The first layer of the network decides whether or not to “turn on” each of the h j ’s. It does this by comparing � x to a series of linear threshold vectors: � w (1) 1 ¯ x > 0 k � � � w (1) h k = σ ¯ k � x ≈ w (1) 0 ¯ k � x < 0

  20. Intro Example #1 Example #2 Learning Backprop Example Summary Example #2: Semicircle → Parabola

  21. Intro Example #1 Example #2 Learning Backprop Example Summary Outline Intro 1 Example #1: Neural Net as Universal Approximator 2 Example #2: Semicircle → Parabola 3 Learning: Gradient Descent and Back-Propagation 4 Backprop Example: Semicircle → Parabola 5 Summary 6

  22. Intro Example #1 Example #2 Learning Backprop Example Summary How to train a neural network 1 Find a training dataset that contains n examples showing the desired output, � y i , that the NN should compute in response to input vector � x i : D = { ( � x 1 , � y 1 ) , . . . , ( � x n , � y n ) } 2 Randomly initialize the weights and biases, W (1) , � b (1) , W (2) , and � b (2) . 3 Perform forward propagation : find out what the neural net computes as ˆ y i for each � x i . 4 Define a loss function that measures how badly ˆ y differs from � y . 5 Perform back propagation to improve W (1) , � b (1) , W (2) , and � b (2) . 6 Repeat steps 3-5 until convergence.

  23. Intro Example #1 Example #2 Learning Backprop Example Summary x ) be “similar to” h ∗ ( � Loss Function: How should h ( � x )? Minimum Mean Squared Error (MMSE) n W ∗ , b ∗ = arg min L = arg min 1 � x i ) � 2 � � y i − ˆ y ( � 2 n i =1 MMSE Solution: ˆ y → E [ � y | � x ] If the training samples ( � x i , � y i ) are i.i.d., then n →∞ L = 1 y � 2 � � lim � � y − ˆ 2 E which is minimized by y MMSE ( � ˆ x ) = E [ � y | � x ]

  24. Intro Example #1 Example #2 Learning Backprop Example Summary Gradient Descent: How do we improve W and b ? Given some initial neural net parameter (called u kj in this figure), we want to find a better value of the same parameter. We do that using gradient descent: u kj ← u kj − η d L , du kj where η is a learning rate (some small constant, e.g., η = 0 . 02 or so).

  25. Intro Example #1 Example #2 Learning Backprop Example Summary Gradient Descent = Local Optimization Given an initial W , b , find new values of W , b with lower error. kj − η d L w (1) ← w (1) kj dw (1) kj kj − η d L w (2) ← w (2) kj dw (2) kj η =Learning Rate If η too large, gradient descent won’t converge. If too small, convergence is slow. Second-order methods like L-BFGS and Adam choose an optimal η at each step, so they’re MUCH faster.

Recommend


More recommend