Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04
Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)
XOR is not linearly separable Original x space 1 x 2 0 0 1 x 1 Figure 6.1, left (Goodfellow 2017)
Rectified Linear Activation g ( z ) = max { 0 , z } 0 0 z Figure 6.3 (Goodfellow 2017)
Network Diagrams y y w h 1 h 1 h 2 h 2 h W x x 1 x 1 x 2 x 2 Figure 6.2 (Goodfellow 2017)
Solving XOR f ( x ; W , c , w , b ) = w > max { 0 , W > x + c } + b. (6.3) 1 � 1 W = (6.4) , 1 1 � 0 c = (6.5) , − 1 � 1 w = (6.6) , − 2 (Goodfellow 2017)
Solving XOR Original x space Learned h space 1 1 x 2 h 2 0 0 0 1 0 1 2 x 1 h 1 Figure 6.1 (Goodfellow 2017)
Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)
Gradient-Based Learning • Specify • Model • Cost • Design model and cost so cost is smooth • Minimize cost using gradient descent or related techniques (Goodfellow 2017)
Conditional Distributions and Cross-Entropy p data log p model ( y | x ) . J ( θ ) = − E x , y ∼ ˆ (6.12) (Goodfellow 2017)
Output Types Output Output Cost Output Type Distribution Layer Function Binary cross- Binary Bernoulli Sigmoid entropy Discrete cross- Discrete Multinoulli Softmax entropy Gaussian cross- Continuous Gaussian Linear entropy (MSE) Mixture of Mixture Continuous Cross-entropy Gaussian Density See part III: GAN, Continuous Arbitrary Various VAE, FVBN (Goodfellow 2017)
Mixture Density Outputs y x Figure 6.4 (Goodfellow 2017)
Don’t mix and match Sigmoid output with target of 1 σ ( z ) Cross-entropy loss MSE loss 1 . 0 0 . 5 0 . 0 − 3 − 2 − 1 0 1 2 3 z (Goodfellow 2017)
Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)
Hidden units • Use ReLUs, 90% of the time • For RNNs, see Chapter 10 • For some research projects, get creative • Many hidden units perform comparably to ReLUs. New hidden units that perform comparably are rarely interesting. (Goodfellow 2017)
Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)
Architecture Basics y h 1 h 1 h 2 h 2 Depth x 1 x 1 x 2 x 2 Width (Goodfellow 2017)
Universal Approximator Theorem • One hidden layer is enough to represent (not learn ) an approximation of any function to an arbitrary degree of accuracy • So why deeper? • Shallow net may need (exponentially) more width • Shallow net may overfit more (Goodfellow 2017)
Exponential Representation Advantage of Depth Figure 6.5 (Goodfellow 2017)
Better Generalization with Greater Depth 96 . 5 96 . 0 Test accuracy (percent) 95 . 5 95 . 0 94 . 5 94 . 0 93 . 5 93 . 0 92 . 5 92 . 0 3 4 5 6 7 8 9 10 11 Layers Figure 6.6 (Goodfellow 2017)
Large, Shallow Models Overfit More 97 3, convolutional Test accuracy (percent) 96 3, fully connected 95 11, convolutional 94 93 92 91 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 × 10 8 Number of parameters Figure 6.7 (Goodfellow 2017)
Roadmap • Example: Learning XOR • Gradient-Based Learning • Hidden Units • Architecture Design • Back-Propagation (Goodfellow 2017)
Back-Propagation • Back-propagation is “just the chain rule” of calculus dx = dz dz dy (6.44) dx. dy ◆ > ✓ ∂ y r x z = (6.46) r y z, ∂ x • But it’s a particular implementation of the chain rule • Uses dynamic programming (table filling) • Avoids recomputing repeated subexpressions • Speed vs memory tradeo ff (Goodfellow 2017)
Simple Back-Prop Example Compute loss y Compute activations Compute derivatives Forward prop Back-prop h 1 h 1 h 2 h 2 x 1 x 1 x 2 x 2 (Goodfellow 2017)
Computation Graphs ˆ ˆ y y Multiplication σ Logistic regression u (1) u (1) u (2) u (2) z + dot × b y x x x w w (a) (b) u (2) u (2) u (3) u (3) H ReLU layer relu × sum Linear regression U (1) U (1) U (2) U (2) u (1) u (1) ˆ ˆ y y and weight decay + sqr dot matmul X W b b λ x x w w (c) (d) Figure 6.8 (Goodfellow 2017)
Repeated Subexpressions z f ∂ z (6.50) ∂ w y = ∂ z ∂ y ∂ x (6.51) ∂ y ∂ x ∂ w f = f 0 ( y ) f 0 ( x ) f 0 ( w ) (6.52) x = f 0 ( f ( f ( w ))) f 0 ( f ( w )) f 0 ( w ) (6.53) f w Back-prop avoids computing this twice Figure 6.9 (Goodfellow 2017)
Symbol-to-Symbol Di ff erentiation z z Figure 6.10 f f f 0 dz dz y y dy dy f f f 0 × dy dy dz dz x x dx dx dx dx f f f 0 × dx dx dz dz w w dw dw dw dw (Goodfellow 2017)
Neural Network Loss Function J MLE J MLE J cross_entropy + U (2) U (2) u (8) u (8) y matmul × W (2) W (2) U (5) U (5) u (6) u (6) u (7) u (7) H λ sqr sum + relu U (1) U (1) matmul Figure 6.11 W (1) W (1) U (3) U (3) u (4) u (4) X sqr sum (Goodfellow 2017)
Hessian-vector Products h ( r x f ( x )) > v i Hv = r x (6.59) . (Goodfellow 2017)
Questions (Goodfellow 2017)
Recommend
More recommend