Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein
Logistic Regression
Perceptron & Probabilities • What if we want a probability p(y|x)? • The perceptron gives us a prediction y • Let’s illustrate this with binary classification Illustrations: Graham Neubig
The logistic function • “Softer” function than in perceptron • Can account for uncertainty • Differentiable
Logistic regression: how to train? • Train based on conditional likelihood • Find parameters w that maximize conditional likelihood of all answers 𝑧 " given examples 𝑦 "
Stochastic gradient ascent (or descent) • Online training algorithm for logistic regression • and other probabilistic models Update weights for every training example • Move in direction given by gradient • Size of update step scaled by learning rate •
Gradient of the logistic function
Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person
Example: initial update
Example: second update
How to set the learning rate? • Various strategies • decay over time 1 𝛽 = 𝐷 + 𝑢 Number of Parameter samples • Use held-out test set, increase learning rate when likelihood increases
Multiclass version
Some models are better then others… • Consider these 2 examples • Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better
Regularization • A penalty on adding extra weights • L2 regularization: 𝑥 + • big penalty on large weights • small penalty on small weights 𝑥 , • L1 regularization: • Uniform increase when large or small • Will cause many weights to become zero
L1 regularization in online learning
What you should know • Standard supervised learning set-up for text classification • Difference between train vs. test data • How to evaluate • 3 examples of supervised linear classifiers • Naïve Bayes, Perceptron, Logistic Regression • Learning as optimization: what is the objective function optimized? • Difference between generative vs. discriminative classifiers • Smoothing, regularization • Overfitting, underfitting
Neural networks
Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person
Formalizing binary prediction
The Perceptron: a “machine” to calculate a weighted sum φ “A” = 1 0 φ “site” = 1 -3 φ “located” = 1 0 0 φ “Maizuru” = 1 . -1 0 sign - 𝑥 " ⋅ ϕ " 𝑦 φ “,” = 2 0 "/, φ “in” = 1 0 2 φ “Kyoto” = 1 0 φ “priest” = 0 φ “black” = 0
The Perceptron: Geometric interpretation O X O X O X
The Perceptron: Geometric interpretation O X O X O X
Limitation of perceptron ● can only find linear separations between positive and negative examples X O O X
Neural Networks ● Connect together multiple perceptrons φ “A” = 1 φ “site” = 1 φ “located” = 1 φ “Maizuru” = 1 -1 φ “,” = 2 φ “in” = 1 φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 ● Motivation: Can represent non-linear functions!
Neural Networks: key terms • Input (aka features) • Output φ “A” = 1 • Nodes φ “site” = 1 φ “located” = 1 • Layers φ “Maizuru” = 1 • Hidden layers -1 φ “,” = 2 • Activation function φ “in” = 1 (non-linear) φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 • Multi-layer perceptron
Example ● Create two classifiers w 0,0 φ 0 [0] 1 φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} 1 φ 0 [1] φ 1 [0] sign φ 0 [1] X O -1 1 b 0,0 φ 0 [0] w 0,1 O X φ 0 [0] -1 φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} -1 φ 0 [1] φ 1 [1] sign -1 1 b 0,1
Example ● These classifiers map to a new space φ 1 (x 3 ) = {-1, 1} φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 [1] φ 2 O X O φ 1 φ 1 [0] X O X O φ 1 (x 1 ) = {-1, -1} φ 1 (x 2 ) = {1, -1} φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 (x 4 ) = {-1, -1} 1 φ 1 [0] 1 -1 -1 φ 1 [1] -1 -1
Example ● In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 0 [1] X O φ 0 [0] 1 φ 2 [0] = y 1 1 O X φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 [1] 1 O φ 1 (x 3 ) = {-1, 1} φ 1 [0] 1 -1 φ 1 [0] -1 φ 1 (x 1 ) = {-1, -1} X φ 1 [1] -1 O φ 1 (x 2 ) = {1, -1} -1 φ 1 (x 4 ) = {-1, -1}
Example wrap-up: Forward propagation ● The final net φ 0 [0] 1 1 φ 0 [1] φ 1 [0] tanh 1 -1 1 φ 0 [0] -1 φ 2 [0] tanh 1 -1 φ 0 [1] φ 1 [1] tanh -1 1 1 1
� � Softmax Function for multiclass classification ● Sigmoid function for multiple classes 𝑓 𝐱⋅6 7,9 Current class 𝑄 𝑧 ∣ 𝑦 = ∑ 𝑓 𝐱⋅6 7,9 ; Sum of other classes ; 9 ● Can be expressed using matrix/vector ops 𝐬 = exp 𝐗 ⋅ ϕ 𝑦, 𝑧 𝐪 = 𝐬 - G 𝑠̃ Ẽ∈𝐬 30
Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words • For every training example, calculate the gradient (the direction that will increase the probability of y) • Move in that direction, multiplied by learning rate α
Gradient of the Sigmoid Function Take the derivative of the probability 𝑓 𝐱⋅6 7 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑥 1 + 𝑓 𝐱⋅6 7 𝑓 𝐱⋅6 7 = ϕ 𝑦 1 + 𝑓 𝐱⋅6 7 + 𝑓 𝐱⋅6 7 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒𝑥 1 − 1 + 𝑓 𝐱⋅6 7 𝑓 𝐱⋅6 7 = −ϕ 𝑦 1 + 𝑓 𝐱⋅6 7 +
Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer 𝐢 𝑦 w 1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 7 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝐢 𝑦 𝑒𝐱 𝟓 1 + 𝑓 𝐱 𝟓 ⋅𝐢 7 + w 2 w 4 ϕ 𝑦 y=1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? w 3 𝑒𝐱 𝟑 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟒
� Answer: Back-Propagation Calculate derivative with chain rule 𝑒𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ , 𝐲 𝑒𝐱 𝟐 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ , 𝐲 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 7 𝑥 ,,R 1 + 𝑓 𝐱 𝟓 ⋅𝐢 7 + Error of Weight Gradient of next unit ( δ 4 ) this unit 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝑒ℎ " 𝐲 In General - δ U 𝑥 ",U 𝐱 𝐣 𝑒𝐱 𝐣 Calculate i based U on next units j :
Backpropagation = Gradient descent + Chain rule
Feed Forward Neural Nets All connections point forward ϕ 𝑦 y It is a directed acyclic graph (DAG)
Neural Networks • Non-linear classification • Prediction: forward propagation • Vector/matrix operations + non-linearities • Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7
Recommend
More recommend