neural networks chapter 11 in esl ii
play

Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical - PowerPoint PPT Presentation

Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjrnsen, oddkol@math.uio.no Learning today Neural nets Projection pursuit What is it? How to solve it: Stagewise Neural nets


  1. Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjørnsen, oddkol@math.uio.no

  2. Learning today Neural nets • Projection pursuit – What is it? – How to solve it: Stagewise • Neural nets – What is it? – Graphical display – Connection to Projection pursuit – How to solve it: Backpropagation – Stochastic Gradient decent – Deep and wide – CNN • Example 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 3

  3. Neural network • Used for prediction • Universal approximation – with enough data and the correct algorithm you will get it right eventually… • Used for both «regression type» and «classification» type problems • Many versions and forms, currently deep learning is a hot topic • Often portrayed as fully automatic, but tailoring might help • Perform highly advanced analysis • Can create utterly complex models which are hard to decipher and hard to use for knowledge transfer. • The network provide good prediction, but is it for the right reasons? Constructed example from: Ribeiro et.al (2016) “Why Should I Trust You?” Explaining the Predictions of Any Classifier 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 4

  4. In neural nets training is based on minimization of a loss function over the training set 𝑂 Neural nets are 𝑀 𝑍, መ 𝑀(𝑧 𝑗 , መ 𝑔 𝑌 = ෍ 𝑔 𝑦 𝑗 ) General form defined by a specific form of 𝑗=1 • the model 𝑔 𝑌 Target might be multi dimensional 𝑧 𝑗 = 𝑧 𝑗1 , … , 𝑧 𝑗𝐿 𝑈 • Continuous response («regression type») 𝑂 𝐿 Squared error 2 (common) 𝑀 𝑍, መ 𝑧 𝑗𝑙 − መ 𝑔 𝑌 = ෍ ෍ 𝑔 𝑙 𝑦 𝑗 𝑗=1 𝑙=1 • Discrete (K – classes) 𝑂 𝐿 2 𝑀 𝑍, መ 𝑧 𝑗𝑙 − መ Squared error 𝑔 𝑌 = ෍ ෍ 𝑔 𝑙 𝑦 𝑗 መ 𝑔 𝑙 𝑦 𝑗 ≈ Prob(𝑧 𝑗𝑙 = 1) 𝑗=1 𝑙=1 𝑂 𝐿 𝑧 𝑗𝑙 = ቊ0 if 𝑧 𝑗 ≠ 𝑙 Cross-entropy 𝑀 𝑍, መ − log መ 1 if 𝑧 𝑗 = 𝑙 𝑔 𝑌 = ෍ ෍ 𝑔 𝑙 𝑦 𝑗 ⋅ 𝑧 𝑗𝑙 or deviance 𝑗=1 𝑙=1 STK-IN4300 Lecture 6- Neural nets 25. september 2019 5

  5. Projection pursuit Regression Unknown functions ( ℝ → ℝ ) 𝑁 𝑈 𝑌) Friedman and Tukey (1974) 𝑔 𝑌 = ෍ 𝑕 𝑛 (𝑥 𝑛 Friedman and Stuetzle (1981) 𝑛=1 Derived feature (number m ), 𝑈 𝑌 , (size 1 × 1 ) 𝑊 𝑛 = 𝑥 𝑛 Unknown Features = Weight Explanatory (size p × 1) Variables Unit vector (size p × 1) Ridge functions are constant along directions orthogonal to the directional unit vector 𝑥 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 6

  6. Fitting Projection pursuit: M=1 • M = 1 model, known as the single index model in econometrics: – 𝑔 𝑌 = 𝑕 𝑥 𝑈 𝑌 = 𝑕(𝑊) , 𝑊 = 𝑥 𝑈 𝑌 • If 𝑥 is known fitting ො 𝑕(𝑤) is just a 1D smoothing problem Iterate – Smoothing spline, Local linear (or polynomial) regression, Kernel smoothing, K- nearest… • If 𝑕() is known fitting ෝ 𝑥 is obtained by quasi-Newton search 𝑈 𝑦 𝑗 𝑈 𝑦 𝑗 + 𝑕 ′ (𝑥 old 𝑈 𝑦 𝑗 ) 𝑥 − 𝑥 old 𝑈 𝑕 𝑥 𝑈 𝑦 𝑗 ≈ 𝑕 𝑥 old – – Minimize the objective function (with approximation inserted) 𝑂 𝑂 2 ≈ ෍ 2 𝑈 𝑦 𝑗 − 𝑕 ′ (𝑥 old 𝑈 𝑦 𝑗 ) 𝑥 − 𝑥 old 𝑈 𝑦 𝑗 𝑧 𝑗 − 𝑕 𝑥 𝑈 𝑦 𝑗 𝑈 ෍ 𝑧 𝑗 − 𝑕 𝑥 old 𝑗=1 𝑗=1 𝑂 2 𝑈 𝑦 𝑗 2 𝑧 𝑗 − 𝑕 𝑥 old Solve for 𝑥 using 𝑈 𝑦 𝑗 𝑈 𝑦 𝑗 − 𝑥 𝑈 𝑦 𝑗 𝑕 ′ 𝑥 old = ෍ + 𝑥 old weighted regression: 𝑈 𝑦 𝑗 𝑕 ′ 𝑥 old 𝑈 𝑦 𝑗 2 weight = 𝑕 ′ 𝑥 old 7 𝑗=1 25. September 2019 STK-IN 4300 Lecture 4- Neural nets

  7. Fitting Projection pursuit, 𝑁 > 1 𝑁 • Stage-wise (greedy) 𝑈 𝑌) 𝑔 𝑌 = ෍ 𝑕 𝑛 (𝑥 𝑛 – Set 𝑧 𝑗,1 = 𝑧 𝑗 𝑛=1 – For 𝑛 = 1, … , 𝑁 • Assume there is just one function to match (as previous page) • Minimize Loss with respect to 𝑧 𝑗,𝑛 to obtain 𝑕 𝑛 () and 𝑥 𝑛 𝑂 2 𝑈 𝑦 𝑗 [ ො 𝑕 𝑛 ⋅ , ෝ 𝑥 𝑛 ] = argmin ෍ 𝑧 𝑗,𝑛 − 𝑕 𝑛 𝑥 𝑛 𝑕 𝑛 (⋅),𝑥 𝑛 𝑗=1 • Store ො 𝑕 𝑛 ⋅ and ෝ 𝑥 𝑛 𝑈 𝑦 𝑗 • Subtract estimate from data 𝑧 𝑗,𝑛+1 = 𝑧 𝑗,𝑛 − ො 𝑕 𝑛 ෝ 𝑥 𝑛 – Final prediction: 𝑁 𝑈 𝑌) መ 𝑔 𝑌 = ෍ 𝑕 𝑛 (ෝ ො 𝑥 𝑛 𝑛=1 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 8

  8. Implementation details 1. Need a smoothing method with efficient evaluation of 𝑕(𝑤) and 𝑕 ′ 𝑤 – Local regression or smoothing splines 𝑕 𝑛 (𝑤) from previous steps can be readjusted using a backfitting 2. procedure (Chapter 9), but it is unclear if this improves the performance 1. Set 𝑠 𝑗 = 𝑧 𝑗 − መ 𝑔 𝑦 𝑗 + ො 𝑕 𝑛 ෝ 𝑥 𝑛 𝑦 𝑗 2. Re-estimate 𝑕 𝑛 (⋅) from 𝑠 𝑗 . (and center the results) 3. Do this repeatedly for 𝑛 = 1, … 𝑁, 1 … 𝑁, … 3. It is not common to readjust ෝ 𝑥 𝑛 , as this is computationally demanding 4. Stopping criterion for number of terms to include . 1. When the model does not improve appreciably 2. Use cross validation to determine M 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 9

  9. Example • Train data: 1000 • Two terms: 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 10

  10. Neural network • Simplified model of a nerve system Perceptron: Input Weights Net input Activation function 𝑦 0 = 1 𝑦 0 𝛽 0 𝛽 1 𝑦 1 𝑞 𝛽 … Output 𝑤 = ෍ 𝛽 𝑗 𝑦 𝑗 𝑗=0 ⋮ 𝜏 𝑤 𝑞 𝛽 𝑞 𝜏 ෍ 𝛽 𝑗 𝑦 𝑗 𝑦 𝑞 𝑗=0 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 11

  11. Activation functions 𝜏 𝑤 • never Initially: The binary step function used • Next: Sigmoid = Logistic = Soft step Smooth • Now: there is a «rag bag» of alternatives some more suited than others for specific tasks – Smooth #2 ArcTan Continious #1 – Rectified linear ReLu (and variants) Smooth – Gaussian (NB not monotone gives different behavior ) Illustrations from: https://en.wikipedia.org/wiki/Activation_function 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 12

  12. Single layer feed-forward Neural nets Activation function ( ℝ → ℝ ) 𝑁 𝑈 𝑌 + 𝛽 0 𝑔 𝑌 = ෍ 𝛾 𝑛 𝜏 𝛽 𝑛 𝑛=1 Derived feature (number m ), 𝑈 𝑌 + 𝛽 0 , (size 1 × 1 ) 𝑎 𝑛 = 𝛽 𝑛 Unknown «Bias» or «Shift» Feature = Weight Explanatory (size (p × 1) Sigmoid variables Not unit vector (size p × 1) 𝜏 𝑡 ⋅ 𝑤 unit vector scale s=1 𝑈 𝑌 + 𝛽 0 𝜏 𝛽 𝑈 𝑌 + 𝛽 0 = 𝜏 𝑡 𝑛 ⋅ 𝑥 𝑛 = 𝜏 𝑡 𝑛 ⋅ 𝑊 𝑛 + 𝛽 0 s=0.5 s=10 𝑥 𝑛 = 𝛽 𝑛 «PP – Feature» , 𝑡 𝑛 = 𝛽 𝑛 𝑡 𝑛 13 25. September 2019 STK-IN 4300 Lecture 4- Neural nets

  13. Graphical display of single hidden layer feed forward neural network 𝑁 𝑈 𝑌 + 𝛽 0 Output 𝑔 𝑙 𝑌 = ෍ 𝛾 𝑙,𝑛 𝜏 𝛽 𝑛 𝑛=1 We will however traverse the or W 2 𝛾 Note! graph in the With respect to model opposite direction definition as well …. 𝜏(⋅) Hidden Feed forward means: • Connections in the 𝛽 or W 1 graph are directional • The direction goes Input from input to output 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 14

  14. Output layer is often «different» 𝑈&𝑍 𝑈 𝑌 , Hidden layer: 𝑎 𝑛 = 𝜏 𝛽 0,𝑛 + 𝛽 𝑛 𝑛 = 1, … 𝑁 𝑈 𝑎, 𝑎 Output layer: 𝑈 𝑙 = 𝛾 0,𝑙 , +𝛾 𝑙 𝑙 = 1 … 𝐿 Some alternatives for 𝑔 𝑙 () : 𝑌 𝜏(𝑈 𝑙 ) Transform Same as «hidden» layers Identity 𝑈 𝑙 Common in regression setting Joint transform 𝑕 𝑙 (𝑈) Common for classification, e.g. softmax Identity Softmax exp 𝑈 𝑙 𝑔 𝑙 𝑌 = 𝑈 𝑙 𝑔 𝑙 𝑌 = 𝐿 𝑈 𝑌 + 𝛽 0,𝑛 σ 𝑘=1 exp(𝑈 𝑘 ) 𝑁 = 𝛾 0,𝑙 + σ 𝑛=1 𝛾 𝑙,𝑛 𝜏 𝛽 𝑛 𝑈 𝑎 exp 𝛾 0,𝑙 +𝛾 𝑙 = 𝐿 𝑈 𝑎) σ 𝑘=1 exp(𝛾 0,𝑘 +𝛾 𝑘 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 15

  15. Comparision Projection pursuit (PP) and Neural nets (NN) 𝑁 𝑄𝑄 𝑁 𝑂𝑂 𝑈 𝑌) 𝑈 𝑌 + 𝛽 0 𝑔 𝑌 = ෍ 𝑕 𝑛 (𝑥 𝑛 𝑔 𝑌 = ෍ 𝛾 𝑛 𝜏 𝛽 𝑛 𝑛=1 𝑛=1 𝑈 𝑌) 𝑈 𝑌 + 𝛽 0 𝑡 𝑛 = | 𝛽 | 𝑕 𝑛 (𝑥 𝑛 vs 𝛾 𝑛 𝜏 𝑡 𝑛 ⋅ 𝑥 𝑛 • The flexibility of 𝑕 𝑛 is much larger than what is obtained with 𝑡 𝑛 and 𝛽 0 which are the additional parameters of neural nets • There are usually less terms in PP than NN, i.e. 𝑁 𝑄𝑄 ≪ 𝑁 𝑂𝑂 • Both methods are powerful for regression and classification • Effective in problems with high signal to noise ratio • Suited for prediction without interpretation • Identifiability of weights an open question and creates problems in interpretations • The fitting procedures are different 25. September 2019 STK-IN 4300 Lecture 4- Neural nets 16

Recommend


More recommend