Neural Networks and Backpropagation Neural Net Readings: Matt - PowerPoint PPT Presentation

10-‑601 ¡Introduction ¡to ¡Machine ¡Learning Machine ¡Learning ¡Department School ¡of ¡Computer ¡Science Carnegie ¡Mellon ¡University Neural ¡Networks and Backpropagation Neural ¡Net ¡Readings: Matt ¡Gormley Murphy ¡-‑-‑ Bishop ¡5 Lecture ¡20 HTF ¡11 April ¡3, ¡2017 Mitchell ¡4 1

Reminders • Homework 6: ¡Unsupervised Learning – Release: ¡Wed, ¡Mar. ¡22 – Due: ¡Mon, ¡Apr. ¡03 ¡at ¡11:59pm • Homework 5 (Part II): ¡Peer ¡Review – Release: ¡Wed, ¡Mar. ¡29 Expectation: ¡You ¡ should ¡spend ¡at ¡most ¡1 ¡ – Due: ¡Wed, ¡Apr. ¡05 ¡at ¡11:59pm hour ¡on ¡your ¡reviews • Peer ¡Tutoring 2

Neural ¡Networks ¡Outline • Logistic ¡Regression ¡(Recap) – Data, ¡Model, ¡Learning, ¡Prediction • Neural ¡Networks – A ¡Recipe ¡for ¡Machine ¡Learning Last ¡Lecture – Visual ¡Notation ¡for ¡Neural ¡Networks – Example: ¡Logistic ¡Regression ¡Output ¡Surface – 2-‑Layer ¡Neural ¡Network – 3-‑Layer ¡Neural ¡Network • Neural ¡Net ¡Architectures – Objective ¡Functions – Activation ¡Functions • Backpropagation – Basic ¡Chain ¡Rule ¡(of ¡calculus) This ¡Lecture – Chain ¡Rule ¡for ¡Arbitrary ¡Computation ¡Graph – Backpropagation ¡Algorithm – Module-‑based ¡Automatic ¡Differentiation ¡ (Autodiff) 3

DECISION ¡BOUNDARY ¡EXAMPLES 4

Example ¡#1: ¡Diagonal ¡Band 5

Example ¡#2: ¡One ¡Pocket 6

Example ¡#3: ¡Four ¡Gaussians 7

Example ¡#4: ¡Two ¡Pockets 8

Example ¡#1: ¡Diagonal ¡Band Error ¡in ¡slides: ¡ “layers” ¡ should ¡read ¡“number ¡of ¡ hidden ¡units” All ¡the ¡neural ¡networks ¡in ¡ this ¡section ¡used ¡1 ¡hidden ¡ layer. 11

ARCHITECTURES 54

Neural ¡Network ¡Architectures Even ¡for ¡a ¡basic ¡Neural ¡Network, ¡there ¡are ¡ many ¡design ¡decisions ¡to ¡make: 1. # ¡of ¡hidden ¡layers ¡(depth) 2. # ¡of ¡units ¡per ¡hidden ¡layer ¡(width) 3. Type ¡of ¡activation ¡function ¡(nonlinearity) 4. Form ¡of ¡objective ¡function 55

Activation ¡Functions (F) Loss Neural ¡Network ¡with ¡sigmoid ¡ J = 1 2 ( y − y ∗ ) 2 activation ¡functions (E) Output (sigmoid) 1 y = 1+ �� ( − b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden ¡Layer (C) Hidden (sigmoid) 1 z j = 1+ �� ( − a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 56

Activation ¡Functions (F) Loss Neural ¡Network ¡with ¡arbitrary ¡ J = 1 2 ( y − y ∗ ) 2 nonlinear ¡activation ¡functions (E) Output (nonlinear) y = σ ( b ) Output (D) Output (linear) b = � D j =0 β j z j … Hidden ¡Layer (C) Hidden (nonlinear) z j = σ ( a j ) , ∀ j … Input (B) Hidden (linear) a j = � M i =0 α ji x i , ∀ j (A) Input Given x i , ∀ i 57

Activation ¡Functions So ¡far, ¡we’ve ¡ Sigmoid ¡/ ¡Logistic ¡Function assumed ¡that ¡the ¡ 1 logistic( u ) ≡ activation ¡function ¡ 1 + e − u (nonlinearity) ¡is ¡ always ¡the ¡sigmoid ¡ function… 58

Activation ¡Functions • A ¡new ¡change: ¡modifying ¡the ¡nonlinearity – The ¡logistic ¡is ¡not ¡widely ¡used ¡in ¡modern ¡ANNs Alternate ¡1: ¡ tanh Like ¡logistic ¡function ¡but ¡ shifted ¡to ¡range ¡[-‑1, ¡+1] Slide ¡from ¡William ¡Cohen

AI ¡Stats ¡2010 depth ¡4 ? sigmoid ¡ vs. ¡ tanh Figure ¡from ¡Glorot & ¡Bentio (2010)

Activation ¡Functions • A ¡new ¡change: ¡modifying ¡the ¡nonlinearity – reLU often ¡used ¡in ¡vision ¡tasks Alternate ¡2: ¡rectified ¡linear ¡unit Linear ¡with ¡a ¡cutoff ¡at ¡zero (Implementation: ¡clip ¡the ¡gradient ¡ when ¡you ¡pass ¡zero) Slide ¡from ¡William ¡Cohen

Activation ¡Functions • A ¡new ¡change: ¡modifying ¡the ¡nonlinearity – reLU often ¡used ¡in ¡vision ¡tasks Alternate ¡2: ¡rectified ¡linear ¡unit Soft ¡version: ¡log(exp(x)+1) Doesn’t ¡saturate ¡(at ¡one ¡end) Sparsifies outputs Helps ¡with ¡vanishing ¡gradient ¡ Slide ¡from ¡William ¡Cohen

Objective ¡Functions ¡for ¡NNs • Regression: – Use ¡the ¡same ¡objective ¡as ¡Linear ¡Regression – Quadratic ¡loss ¡(i.e. ¡mean ¡squared ¡error) • Classification: – Use ¡the ¡same ¡objective ¡as ¡Logistic ¡Regression – Cross-‑entropy ¡(i.e. ¡negative ¡log ¡likelihood) – This ¡requires ¡probabilities, ¡so ¡we ¡add ¡an ¡additional ¡ “softmax” ¡layer ¡at ¡the ¡end ¡of ¡our ¡network Forward Backward J = 1 dJ Quadratic 2( y − y ∗ ) 2 dy = y − y ∗ dy = y ∗ 1 1 dJ Cross Entropy J = y ∗ �� ( y ) + (1 − y ∗ ) �� (1 − y ) y + (1 − y ∗ ) y − 1 63

Cross-‑entropy ¡vs. ¡Quadratic ¡loss Figure ¡from ¡Glorot & ¡Bentio (2010)

A ¡Recipe ¡for ¡ Background Machine ¡Learning 1. ¡Given ¡training ¡data: 3. ¡Define ¡goal: 2. ¡Choose ¡each ¡of ¡these: – Decision ¡function 4. ¡Train ¡with ¡SGD: (take ¡small ¡steps ¡ opposite ¡the ¡gradient) – Loss ¡function 67

Objective ¡Functions Matching ¡Quiz: Suppose ¡you ¡are ¡given ¡a ¡neural ¡net ¡with ¡a ¡ single ¡output, ¡y, ¡and ¡one ¡hidden ¡layer. 5) ¡…MLE ¡estimates ¡of ¡weights ¡assuming ¡ 1) ¡Minimizing ¡sum ¡of ¡squared ¡ target follows ¡a ¡Bernoulli ¡with ¡ errors… parameter ¡given ¡by ¡the ¡output ¡value 2) ¡Minimizing ¡sum ¡of ¡squared ¡ 6) ¡…MAP ¡estimates ¡of weights errors ¡plus ¡squared Euclidean ¡ assuming ¡weight ¡priors ¡are ¡zero ¡mean ¡ norm ¡of ¡weights… …gives… Gaussian 3) ¡Minimizing cross-‑entropy… 7) ¡…estimates ¡with ¡a ¡large margin ¡on ¡ 4) ¡Minimizing ¡hinge loss… the ¡training ¡data 8) ¡…MLE ¡estimates ¡of ¡weights ¡assuming ¡ zero ¡mean ¡Gaussian ¡noise ¡on ¡the ¡output ¡ value A. 1=5, ¡2=7, ¡3=6, ¡4=8 E. 1=8, ¡2=6, ¡3=5, ¡4=7 B. 1=5, ¡2=7, ¡3=8, ¡4=6 F. 1=8, ¡2=6, ¡3=8, ¡4=6 C. 1=7, ¡2=5, ¡3=5, ¡4=7 D. 1=7, ¡2=5, ¡3=6, ¡4=8 68

BACKPROPAGATION 69

A ¡Recipe ¡for ¡ Background Machine ¡Learning 1. ¡Given ¡training ¡data: 3. ¡Define ¡goal: 2. ¡Choose ¡each ¡of ¡these: – Decision ¡function 4. ¡Train ¡with ¡SGD: (take ¡small ¡steps ¡ opposite ¡the ¡gradient) – Loss ¡function 70

Approaches ¡to ¡ Training Differentiation • Question ¡1: When ¡can ¡we ¡compute ¡the ¡gradients ¡of ¡the ¡ parameters ¡of ¡an ¡arbitrary ¡neural ¡network? • Question ¡2: When ¡can ¡we ¡make ¡the ¡gradient ¡ computation ¡efficient? 71

Neural Networks and Backpropagation Neural Net Readings: Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Matt Gormley Murphy

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Feed-forward Networks Network

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks + Convolutional Neural Networks Last Class Global Features The perceptron

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks + Backpropagation Last Class Softmax Classifier Generalization /

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Preparation toward perverse sheaves Alright. Now let us assume that D X has all the properties

Chapter 2, Part 1 FIRST ORDER EQUATIONS F ( x, y, y ) = 0 The equation Basic assumption: y

! Special Functions ! Differential Equations ! Fourier Series and Transforms ! Probability and

Inverse problems in TDA Focus on metric graphs Steve Oudot joint work with Elchanan (Isaac)

2 Unit Bridging Course Day 10 Circular Functions III The cosine function, identities and

Pixek Seny Kamara,Tarik Moataz, Martin Zhu 1 2 9,198,580,293* 4% * since 2013 3 Why so Few?

Lecture 3.2: Computing Fourier series and exploiting symmetry Matthew Macauley Department of

Lecture 17 : Double Integrals 0/ 15 Some of you have not learned how to do double integrals. In

Neural Networks and Backpropagation Neural Net Readings: Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Matt Gormley Murphy

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

Neural Networks Greg Mori - CMPT 419/726 Bishop PRML Ch. 5 Feed-forward Networks Network

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Feed-forward Networks Network

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks + Convolutional Neural Networks Last Class Global Features The perceptron

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&amp;A 3 BACKPROPAGATION 4 A

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks + Backpropagation Last Class Softmax Classifier Generalization /

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Preparation toward perverse sheaves Alright. Now let us assume that D X has all the properties

Chapter 2, Part 1 FIRST ORDER EQUATIONS F ( x, y, y ) = 0 The equation Basic assumption: y

! Special Functions ! Differential Equations ! Fourier Series and Transforms ! Probability and

Inverse problems in TDA Focus on metric graphs Steve Oudot joint work with Elchanan (Isaac)

2 Unit Bridging Course Day 10 Circular Functions III The cosine function, identities and

Pixek Seny Kamara,Tarik Moataz, Martin Zhu 1 2 9,198,580,293* 4% * since 2013 3 Why so Few?

Lecture 3.2: Computing Fourier series and exploiting symmetry Matthew Macauley Department of

Lecture 17 : Double Integrals 0/ 15 Some of you have not learned how to do double integrals. In

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A