Logistic Regression Dr. Besnik Fetahu
Supervised Classification X = { x (1) , . . . , x ( n ) } Y = { T, F } Input instances Output labels (classes) S = { x ( i ) , y ( i ) } m Training IID examples (input-target samples) i =1 f ( x ( i ) ) → y ( i ) Learn a function that maps x (i) to y (i) 2
Generative vs. Discriminative Classifiers • Generative and Discriminative models are two different machine learning models that are used for classification • Generative models (Naïve Bayes) learn the joint distribution P(x,y): • How are the observations of different classes generated? P(x|Y=y) • Discriminative models (Logistic Regression) learns only how to distinguish between the different classes: • Which feature distinguishes best the different classes? P(Y=y|x) 3
Generative vs. Discriminative Classifiers Generative: will try to model how horses look like! Discriminative: will try to map horse instances to the correct class ! 4
Generative Models 5
Naïve Bayes • For an input instance x ( e.g. a document ) predict the class y (e.g. the topic): y max = arg max y ∈ Y P ( Y = y | x ) P ( x | Y = y ) P ( Y = y ) = arg max P ( x ) y ∈ Y = arg max y ∈ Y P ( x | Y = y ) P ( Y = y ) = arg max y ∈ Y P ( x 1 . . . x k | Y = y ) P ( Y = y ) prior likelihood 6
Naïve Bayes y max = arg max y ∈ Y P ( x 1 . . . x k | Y = y ) P ( Y = y ) = P ( x 1 | y ) ∗ . . . P ( x k | y ) ∗ P ( y ) k Y = arg max y ∈ Y P ( y ) P ( x i | y ) i =1 Feature independence assumption 7
Generative Classifiers • Generative models try model the input space (e.g. what are the characteristics of instances belonging to some class y) • Use the Bayes rule to make predictions • Generative models by modelling P(x|y) solve intermediate problems that are not directly related to P(y|x). What class does x belong to? O ( | X | n | Y | ) • Number of parameters is linear to the feature space and number of class • Describe how likely a class y will generate some instance x (likelihood term) 8
Discriminative Models 9
Discriminative Models • Map the input instance features to the correct target label! • Discriminative models optimize directly for accuracy in predicting the right class. • Assign high weights to features for the input instances that have high ability to discriminate between the different classes. • Logistic regression is a discriminative model • Use a sigmoid or softmax function for determining the right class for P(y|x) 10
Logistic Regression • What do we need for a logistic regression model in the binary case? x ( i ) = [ x ( i ) 1 . . . x ( i ) • Feature representation k ] • Classification function: sigmoid function • Objective function for learning (loss function) • Algorithm for optimizing the loss function • LR learns a set of feature weights w and a bias factor b based on some training data for the classification task. 11
LR – Classification • Classification: k ! X z = w i x i + b i =1 • w represents the importance of the individual features for our input space (e.g. “awesome” is important in determining positive sentiment ) • b is the bias term , also called the intercept 12
LR – Classification • Classification: k ! X z = w i x i + b i =1 • To classify, we push z through a sigmoid function (aka logistic function) 1 σ ( z ) = 1 + e − z 13
LR – Classification 1 σ ( z ) = 1 + e − z 14
LR – Classification • How can we classify through the sigmoid function? P ( y = 1) = σ ( w · x + b ) P ( y = 0) = 1 − σ ( w · x + b ) 1 1 = 1 − = 1 + e − ( w · x + b ) 1 + e − ( w · x + b ) e − ( w · x + b ) = 1 + e − ( w · x + b ) ( 1 if P ( y = 1 | X i ) > 0 . 5 y i = b Decision boundary 0 otherwise 15
LR - Feature Space 16
LR – Classification Example • Assume we know the optimal w and b: w = [2 . 5 , − 5 . 0 , − 1 . 2 , 0 . 5 , 2 . 0 , 0 . 7] b = 0 . 1 p ( Y = 1 | x ) = σ ( w · x + b ) = σ ([2 . 5 , − 5 . 0 , − 1 . 2 , 0 . 5 , 2 . 0 , 0 . 7] · [3 , 2 , 1 , 3 , 0 , 4 . 15] + 0 . 1) = σ (1 . 805) =0 . 86 P ( Y = 0 | x ) =1 − σ ( w · x + b ) =0 . 14 17
LR – Feature Design/Engineering • Design features based on the train set • Features should reflect linguistic intuitions (i.e. a document with positive sentiment will contain more word that have a prior positive sentiment) • n-gram features to capture contextual/topical information in NLP tasks • POS tags to capture stylistic information • What features would be useful to determine sentence boundaries? • How about correlated features? 18
How do we learn the parameters of LR? 19
Cross-entropy loss function • Why do we need a loss function? L ( b y, y ) = how much does our prediction b y di ff er from y • What function can we use for L ? • MSE (Mean squared error) used in regression, is very hard to optimize for probabilistic output. • Conditional Maximum Likelihood? • Choose w, b such that they maximize the log probability of the true labels in the training data (the negative log likelihood loss is also called the cross-entropy loss ) 20
Cross-entropy loss function • The binary labelling case we can express in terms of the Bernoulli distribution: y ) 1 − y p ( y | x ) = b y y (1 − b y ) 1 − y ] = log[ b y y (1 − b = y log b y + (1 − y ) log(1 − b y ) This is the log likelihood that should be maximized such that w, b will maximize the probability of our labels being close to the true labels 21
Cross-entropy loss function • To compute the loss function, we need to minimize, thus, we flip the sign of the log likelihood: L CE ( b y, y ) = − log p ( y | x ) = − [ y log b y + (1 − y ) log(1 − b y )] L CE ( w, b ) = − [ y log σ ( w · x + b ) + (1 − y ) log(1 − σ ( w · x + b ))] y = σ ( w · x + b ) b LR model 22
Cross-entropy loss function • Why do need to minimize the log likelihood? • A perfect classifier would assign with perfect probability close to 1 to the correct class (y=1 or y=0) • The closer our prediction is to 1 the better the classifier, and vice versa, the closer it is to zero the worse it is. • The loss goes to zero for perfect classification, whereas goes to infinity for the cases where we get everything wrong ( log 0 ) • Since the two parts in our loss function sum to one, by maximizing the correct label we do this on the expense of the wrong label. 23
Cross-entropy loss function • Loss function for the entire training set: m X Cost ( w, b ) = 1 y ( i ) , y ( i ) ) L CE ( b m i =1 ⇣ ⇣ ⌘⌘ m X = − 1 y ( i ) log σ ( w · x ( i ) + b ) + (1 − y ( i ) ) log 1 − σ ( w · x ( i ) + b m i =1 24
How can we find the minimum? 25
Gradient Descent – GD • Optimal parameters for our loss function: 1 b mL CE ( y ( i ) , x ( i ) ; θ ) θ = arg min θ • GD finds the minimum of a function by figuring out in which direction in the parameter space the function’s slope is rising most steeply and moving in the opposite direction. • In case of convex functions, GD finds the global optimum (minimum) • Cross-entropy loss is a convex function 26
Gradient Descent – GD 27
Gradient Descent – GD • GD finds the gradient of the loss function for a given point and then moves in the opposite direction s.t. the loss function is minimized • The magnitude of the amount of the move in the gradient descent is determined by the value of the slope (or derivative) weighted by some learning rate • In the case of a function with one parameter: w t +1 = w t − η d dwf ( x ; w ) 28
Gradient Descent - GD However, with each time step the gradient will be smaller and smaller, thus, there is no need to adaptively fix the learning rate, as the value of the negative direction as the slope will be less steep too Gradient descent with small (top) and large (bottom) learning rates. Source: Andrew Ng’s Machine Learning course on Coursera 29
Gradient Descent – GD • Cross-entropy loss function has many variables as parameters that GD needs to find their optimal value, thus, we operate in the N-dimensional space • The gradient expresses the directional components of the sharpest slope along each of those N dimensions 30
Gradient Descent – GD • Through GD we answer the question • “How much would a small change in w i influence the total loss in L ?” ∂ ∂ w 1 L ( f ( x ; θ ) , y ) ∂ ∂ w 2 L ( f ( x ; θ ) , y ) r θ L ( f ( x ; θ ) , y )) = . . . ∂ ∂ w n L ( f ( x ; θ ) , y ) θ t +1 = θ t � η r L ( f ( x ; θ ) , y ) 31
Gradient Descent – GD • GD in the case of cross-entropy loss m Cost ( w, b ) = − 1 y ( i ) log σ ( w · x ( i ) + b )+(1 − y ( i ) ) log ⇣ 1 − σ ( w · x ( i ) + b ) ⌘ X m i =1 m ∂ Cost ( w, b ) σ ( w · x ( i ) + b ) − y ( i ) i h x ( i ) X = j ∂ w j i =1 32
Gradient Descent – GD d σ ( z ) = σ ( z )(1 − σ ( z )) dz Use the following derivatives to derive the partial derivative of the cross-entropy loss function dx ln( x ) = 1 d x m ∂ Cost ( w, b ) σ ( w · x ( i ) + b ) − y ( i ) i h x ( i ) X = j ∂ w j i =1 33
Gradient Descent – GD 34
Recommend
More recommend