neural networks
play

Neural Networks Representations Fall 2017 Learning in the net - PowerPoint PPT Presentation

Neural Networks Representations Fall 2017 Learning in the net Problem: Given a collection of input-output pairs, learn the function Learning for classification x 2 x 1 When the net must learn to classify.. Learning for classification x


  1. Neural Networks Representations Fall 2017

  2. Learning in the net • Problem: Given a collection of input-output pairs, learn the function

  3. Learning for classification x 2 x 1 • When the net must learn to classify..

  4. Learning for classification x 2 • In reality – In general not really cleanly separated • So what is the function we learn?

  5. In reality: Trivial linear example x 2 x 1 5 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 5

  6. Non-linearly separable data: 1-D example y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 6

  7. Undesired Function y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 7

  8. What if? y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 8

  9. What if? y 90 instances 10 instances x • What must the value of the function be at this X? – 1 because red dominates? – 0.9 : The average? 9

  10. What if? y 90 instances 10 instances x • What must the value of the function be at this X? Estimate: ≈ 𝑄(1|𝑌) – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 10

  11. What if? y 90 instances Should an infinitesimal nudge of the red dot change the function estimate entirely? 10 instances If not, how do we estimate 𝑄(1|𝑌) ? (since the positions of the red and blue X Values are different) x • What must the value of the function be at this X? Estimate: ≈ 𝑄(1|𝑌) – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 11

  12. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of Y=1 at that point 12

  13. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 13

  14. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 14

  15. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 15

  16. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 16

  17. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 17

  18. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 18

  19. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 19

  20. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 20

  21. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 21

  22. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 22

  23. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 23

  24. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 24

  25. The logistic regression model 1   P ( y 1 x )     ( w w x )  1 e y=1 y=0 x • Class 1 becomes increasingly probable going left to right – Very typical in many problems 25

  26. The logistic perceptron 𝑧 1  P ( y x )     ( w w x )  e 1 𝑥 0 𝑥 1 𝑦 • A sigmoid perceptron with a single input models the a posteriori probability of the class given the input

  27. Non-linearly separable data x 2 x 1 27 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 27

  28. Logistic regression Decision: y > 0.5? 𝑧 x 2 𝑥 0 𝑥 2 𝑥 1 𝑦 1 𝑦 2 x 1 1 When X is a 2-D variable 𝑄 𝑍 = 1 𝑌 = 1 + exp −(σ 𝑗 𝑥 𝑗 𝑦 𝑗 + 𝑥 0 ) • This the perceptron with a sigmoid activation – It actually computes the probability that the input belongs to class 1 28

  29. Estimating the model y x 1   P ( y x ) f ( x )     ( w w x )  1 e • Given the training data (many (𝑦, 𝑧) pairs represented by the dots), estimate 𝑥 0 and 𝑥 1 for the curve 29

  30. Estimating the model y x • Easier to represent using a y = +1/-1 notation 1 1      P y x P y x ( 1 ) ( 1 )        ( w w x ) ( w w x )   1 e 1 e 1  P ( y x )     y ( w w x )  1 e 30

  31. Estimating the model • Given: Training data 𝑌 1 , 𝑧 1 , 𝑌 2 , 𝑧 2 , … , 𝑌 𝑂 , 𝑧 𝑂 • 𝑌 s are vectors, 𝑧 s are binary (0/1) class values • Total probability of data 𝑄 𝑌 1 , 𝑧 1 , 𝑌 2 , 𝑧 2 , … , 𝑌 𝑂 , 𝑧 𝑂 = ෑ 𝑄 𝑌 𝑗 , 𝑧 𝑗 𝑗 1 = ෑ 𝑄 𝑧 𝑗 |𝑌 𝑗 𝑄 𝑌 𝑗 = ෑ 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑄 𝑌 𝑗 𝑗 𝑗 31

  32. Estimating the model • Likelihood 1 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑒𝑏𝑢𝑏 = ෑ 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑄 𝑌 𝑗 𝑗 • Log likelihood log 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑒𝑏𝑢𝑏 = log 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) ෍ log 𝑄 𝑌 𝑗 − ෍ 𝑗 𝑗 32

  33. Maximum Likelihood Estimate 𝑥 0 , ෝ ෝ 𝑥 1 = argmax log 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑒𝑏𝑢𝑏 𝑥 0 ,𝑥 1 • Equals (note argmin rather than argmax) log 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑥 0 , ෝ ෝ 𝑥 1 = argmin ෍ 𝑥 0 ,𝑥 𝑗 • Identical to minimizing the KL divergence between the desired output 𝑧 and actual output 1 1+𝑓 − (𝑥0+𝑥𝑈𝑌𝑗) • Cannot be solved directly, needs gradient descent 33

  34. So what about this one? x 2 • Non-linear classifiers..

  35. First consider the separable case.. x 2 x 1 • When the net must learn to classify..

  36. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net

  37. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier

  38. First consider the separable case.. ??? x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier over the output of the penultimate layer

  39. First consider the separable case.. 𝑧 1 𝑧 2 y 2 x 1 x 2 y 1 • For perfect classification the output of the penultimate layer must be linearly separable

Recommend


More recommend