cs489 698 lecture 9 feb 1 2017
play

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, - PowerPoint PPT Presentation

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt. 10, [HTF] Chapt. 11, [B] Sec. 5.2, 5.3, [M] Sec. 16.5, [RN] Sec. 18.7 CS489/698 (c) 2017 P. Poupart 1 Quick Recap: Linear Models Linear


  1. CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt. 10, [HTF] Chapt. 11, [B] Sec. 5.2, 5.3, [M] Sec. 16.5, [RN] Sec. 18.7 CS489/698 (c) 2017 P. Poupart 1

  2. Quick Recap: Linear Models Linear Regression Linear Classification CS489/698 (c) 2017 P. Poupart 2

  3. Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2017 P. Poupart 3

  4. Non-linear Models • Convenient modeling assumption: linearity • Extension: non-linearity can be obtained by mapping to a non-linear feature space • Limit: the basis functions are chosen a priori and are fixed • Question: can we work with unrestricted non-linear models? CS489/698 (c) 2017 P. Poupart 4

  5. Flexible Non-Linear Models • Idea 1: Select basis functions that correspond to the training data and retain only a subset of them (e.g., Support Vector Machines ) • Idea 2: Learn non-linear basis functions (e.g., Multi-layer Neural Networks ) CS489/698 (c) 2017 P. Poupart 5

  6. Two-Layer Architecture • Feed-forward neural network • Hidden units: • Output units: • Overall: CS489/698 (c) 2017 P. Poupart 6

  7. Common activation functions • Threshold: • Sigmoid: �� � � ��� • Gaussian: � � � �� • Tanh: � �� • Identity: CS489/698 (c) 2017 P. Poupart 7

  8. Adaptive non-linear basis functions • Non-linear regression – : non-linear function and : identity • Non-linear classification – : non-linear function and : sigmoid CS489/698 (c) 2017 P. Poupart 8

  9. Weight training • Parameters: • Objectives: – Error minimization • Backpropagation (aka “backprop”) – Maximum likelihood – Maximum a posteriori – Bayesian learning CS489/698 (c) 2017 P. Poupart 9

  10. Least squared error • Error function • When Linear combo Non-linear basis functions then we are optimizing a linear combination of non- linear basis functions CS489/698 (c) 2017 P. Poupart 10

  11. Sequential Gradient Descent • For each example adjust the weights as follows: • How can we compute the gradient efficiently given an arbitrary network structure? • Answer: backpropagation algorithm CS489/698 (c) 2017 P. Poupart 11

  12. Backpropagation Algorithm • Two phases: – Forward phase: compute output of each unit – Backward phase: compute delta at each unit CS489/698 (c) 2017 P. Poupart 12

  13. Forward phase • Propagate inputs forward to compute the output of each unit • Output at unit : where CS489/698 (c) 2017 P. Poupart 13

  14. Backward phase • Use chain rule to recursively compute gradient � � � – For each weight : �� � �� � – Let � then � – Since then �� CS489/698 (c) 2017 P. Poupart 14

  15. Simple Example • Consider a network with two layers: � �� – Hidden nodes: � �� � � • Tip: – Output node: • Objective: squared error CS489/698 (c) 2017 P. Poupart 15

  16. Simple Example • Forward propagation: – Hidden units: – Output units: • Backward propagation: – Output units: – Hidden units: • Gradients: � – Hidden layers: �� � – Output layer: �� CS489/698 (c) 2017 P. Poupart 16

  17. Non-linear regression examples • Two layer network: – 3 tanh hidden units and 1 identity output unit � � �� CS489/698 (c) 2017 P. Poupart 17

  18. Analysis • Efficiency: – Fast gradient computation: linear in number of weights • Convergence: – Slow convergence (linear rate) – May get trapped in local optima • Prone to overfitting – Solutions: early stopping, regularization (add penalty term to objective) CS489/698 (c) 2017 P. Poupart 18

Recommend


More recommend