Neural Networks Representations Fall 2017
Learning in the net • Problem: Given a collection of input-output pairs, learn the function
Learning for classification x 2 x 1 • When the net must learn to classify..
Learning for classification x 2 • In reality – In general not really cleanly separated • So what is the function we learn?
In reality: Trivial linear example x 2 x 1 5 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 5
Non-linearly separable data: 1-D example y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 6
Undesired Function y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 7
What if? y x • One-dimensional example for visualization – All (red) dots at Y=1 represent instances of class Y=1 – All (blue) dots at Y=0 are from class Y=0 – The data are not linearly separable • In this 1-D example, a linear separator is a threshold • No threshold will cleanly separate red and blue dots 8
What if? y 90 instances 10 instances x • What must the value of the function be at this X? – 1 because red dominates? – 0.9 : The average? 9
What if? y 90 instances 10 instances x • What must the value of the function be at this X? Estimate: ≈ 𝑄(1|𝑌) – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 10
What if? y 90 instances Should an infinitesimal nudge of the red dot change the function estimate entirely? 10 instances If not, how do we estimate 𝑄(1|𝑌) ? (since the positions of the red and blue X Values are different) x • What must the value of the function be at this X? Estimate: ≈ 𝑄(1|𝑌) – 1 because red dominates? Potentially much more useful than a simple 1/0 decision – 0.9 : The average? Also, potentially more realistic 11
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of Y=1 at that point 12
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 13
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 14
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 15
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 16
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 17
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 18
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 19
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 20
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 21
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 22
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 23
The probability of y=1 y x • Consider this differently: at each point look at a small window around that point • Plot the average value within the window – This is an approximation of the probability of 1 at that point 24
The logistic regression model 1 P ( y 1 x ) ( w w x ) 1 e y=1 y=0 x • Class 1 becomes increasingly probable going left to right – Very typical in many problems 25
The logistic perceptron 𝑧 1 P ( y x ) ( w w x ) e 1 𝑥 0 𝑥 1 𝑦 • A sigmoid perceptron with a single input models the a posteriori probability of the class given the input
Non-linearly separable data x 2 x 1 27 • Two-dimensional example – Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1) on the “blue” side – No line will cleanly separate the two colors 27
Logistic regression Decision: y > 0.5? 𝑧 x 2 𝑥 0 𝑥 2 𝑥 1 𝑦 1 𝑦 2 x 1 1 When X is a 2-D variable 𝑄 𝑍 = 1 𝑌 = 1 + exp −(σ 𝑗 𝑥 𝑗 𝑦 𝑗 + 𝑥 0 ) • This the perceptron with a sigmoid activation – It actually computes the probability that the input belongs to class 1 28
Estimating the model y x 1 P ( y x ) f ( x ) ( w w x ) 1 e • Given the training data (many (𝑦, 𝑧) pairs represented by the dots), estimate 𝑥 0 and 𝑥 1 for the curve 29
Estimating the model y x • Easier to represent using a y = +1/-1 notation 1 1 P y x P y x ( 1 ) ( 1 ) ( w w x ) ( w w x ) 1 e 1 e 1 P ( y x ) y ( w w x ) 1 e 30
Estimating the model • Given: Training data 𝑌 1 , 𝑧 1 , 𝑌 2 , 𝑧 2 , … , 𝑌 𝑂 , 𝑧 𝑂 • 𝑌 s are vectors, 𝑧 s are binary (0/1) class values • Total probability of data 𝑄 𝑌 1 , 𝑧 1 , 𝑌 2 , 𝑧 2 , … , 𝑌 𝑂 , 𝑧 𝑂 = ෑ 𝑄 𝑌 𝑗 , 𝑧 𝑗 𝑗 1 = ෑ 𝑄 𝑧 𝑗 |𝑌 𝑗 𝑄 𝑌 𝑗 = ෑ 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑄 𝑌 𝑗 𝑗 𝑗 31
Estimating the model • Likelihood 1 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜 𝑒𝑏𝑢𝑏 = ෑ 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑄 𝑌 𝑗 𝑗 • Log likelihood log 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜 𝑒𝑏𝑢𝑏 = log 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) log 𝑄 𝑌 𝑗 − 𝑗 𝑗 32
Maximum Likelihood Estimate 𝑥 0 , ෝ ෝ 𝑥 1 = argmax log 𝑄 𝑈𝑠𝑏𝑗𝑜𝑗𝑜 𝑒𝑏𝑢𝑏 𝑥 0 ,𝑥 1 • Equals (note argmin rather than argmax) log 1 + 𝑓 −𝑧 𝑗 (𝑥 0 +𝑥 𝑈 𝑌 𝑗 ) 𝑥 0 , ෝ ෝ 𝑥 1 = argmin 𝑥 0 ,𝑥 𝑗 • Identical to minimizing the KL divergence between the desired output 𝑧 and actual output 1 1+𝑓 − (𝑥0+𝑥𝑈𝑌𝑗) • Cannot be solved directly, needs gradient descent 33
So what about this one? x 2 • Non-linear classifiers..
First consider the separable case.. x 2 x 1 • When the net must learn to classify..
First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net
First consider the separable case.. x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier
First consider the separable case.. ??? x 2 x 1 x 2 x 1 • For a “sufficient” net • This final perceptron is a linear classifier over the output of the penultimate layer
First consider the separable case.. 𝑧 1 𝑧 2 y 2 x 1 x 2 y 1 • For perfect classification the output of the penultimate layer must be linearly separable
Recommend
More recommend