neural networks
play

Neural Networks Representations Learning in the net Problem: Given - PowerPoint PPT Presentation

Neural Networks Representations Learning in the net Problem: Given a collection of input-output pairs, learn the function Learning for classification x 2 x 1 When the net must learn to classify.. Learn the classification boundaries


  1. Neural Networks Representations

  2. Learning in the net • Problem: Given a collection of input-output pairs, learn the function

  3. Learning for classification x 2 x 1 • When the net must learn to classify.. – Learn the classification boundaries that separate the training instances

  4. Learning for classification x 2 • In reality – In general not really cleanly separated • So what is the function we learn?

  5. In reality: Trivial linear example x 2 x 1 5 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 5

  6. Non-linearly separable data: 1-D example y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 6

  7. Undesired Function y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 7

  8. What if? y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 8

  9. What if? y 90 instances 10 instances x • What must the value of the function be at this X? – 1 because red dominates? – 0.9 : The average? 9

  10. What if? y 90 instances 10 instances x • What must the value of the function be at this X? Estimate: – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 10

  11. What if? y 90 instances Should an infinitesimal nudge of the red dot change the function estimate entirely? 10 instances If not, how do we estimate 𝑄(1|𝑌) ? (since the positions of the red and blue X Values are different) x • What must the value of the function be at this X? Estimate: – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 11

  12. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of Y=1 at that point 12

  13. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 13

  14. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 14

  15. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 15

  16. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 16

  17. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 17

  18. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 18

  19. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 19

  20. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 20

  21. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 21

  22. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 22

  23. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 23

  24. The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 24

  25. The logistic regression model 1   P ( y 1 x )     ( w w x ) 1 e  y=1 y=0 x • Class 1 becomes increasingly probable going left to right – Very typical in many problems 25

  26. The logistic perceptron 1  P ( y x )     ( w w x ) 1 e  � � • A sigmoid perceptron with a single input models the a posteriori probability of the class given the input

  27. Non-linearly separable data x 2 x 1 27 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 27

  28. Logistic regression Decision: y > 0.5? � � � � x 2 � � � � � When X is a 2-D variable x 1 • This the perceptron with a sigmoid activation – It actually computes the probability that the input belongs to class 1 – Decision boundaries may be obtained by comparing the probability to a threshold • These boundaries will be lines (hyperplanes in higher dimensions) • The sigmoid perceptron is a linear classifier 28

  29. Estimating the model y x 1   P ( y x ) f ( x )     ( w w x ) 1 e  • Given the training data (many pairs represented by the dots), estimate and for the curve 29

  30. Estimating the model y x • Easier to represent using a y = +1/-1 notation 1 1      P ( y 1 x ) P ( y 1 x )       ( w w x )  ( w w x ) 1 e 1 e   1  P ( y x )     y ( w w x ) 1 e  30

  31. Estimating the model • Given: Training data • s are vectors, s are binary (0/1) class values • Total probability of data � � � � 31

  32. Estimating the model • Likelihood � � � � • Log likelihood 32

  33. Maximum Likelihood Estimate � � • Equals (note argmin rather than argmax) • Identical to minimizing the KL divergence between the desired output and actual output • Cannot be solved directly, needs gradient descent 33

  34. So what about this one? x 2 • Non-linear classifiers..

  35. First consider the separable case.. x 2 x 1 • When the net must learn to classify..

  36. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net

  37. First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier

  38. First consider the separable case.. ??? x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier over the output of the penultimate layer

  39. First consider the separable case.. � � y 2 x 1 x 2 y 1 • For perfect classification the output of the penultimate layer must be linearly separable

  40. First consider the separable case.. � � y 2 x 1 x 2 y 1 • The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features – We can now attach any linear classifier above it for perfect classification – Need not be a perceptron – In fact, slapping on an SVM on top of the features may be more generalizable!

  41. First consider the separable case.. � � y 2 x 1 x 2 y 1 • The rest of the network may be viewed as a transformation that transforms data from non-linear classes to linearly separable features – We can now attach any linear classifier above it for perfect classification – Need not be a perceptron – In fact, for binary classifiers an SVM on top of the features may be more generalizable!

Recommend


More recommend