multiclass logistic regression multilayer perceptron
play

Multiclass Logistic Regression, Multilayer Perceptron Milan Straka - PowerPoint PPT Presentation

NPFL129, Lecture 4 Multiclass Logistic Regression, Multilayer Perceptron Milan Straka October 26, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated


  1. NPFL129, Lecture 4 Multiclass Logistic Regression, Multilayer Perceptron Milan Straka October 26, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Logistic Regression p ( C ∣ x ) 0 An extension of perceptron, which models the conditional probabilities of and of p ( C ∣ x ) 1 . Logistic regression can in fact handle also more than two classes, which we will see shortly. Logistic regression employs the following parametrization of the conditional class probabilities: p ( C ∣ x ) = σ ( x w + b ) t 1 p ( C ∣ x ) = 1 − p ( C ∣ x ), 0 1 σ where is a sigmoid function 1 σ ( x ) = . 1 + e − x It can be trained using the SGD algorithm. NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 2/29

  3. Logistic Regression To give some meaning to the sigmoid function, starting with 1 p ( C ∣ x ) = σ ( y ( x ; w )) = 1 1 + e − y ( x ; w ) we can arrive at p ( C ∣ x ) ( p ( C ) 1 y ( x ; w ) = log , ∣ x ) 0 y ( x ; w ) where the prediction of the model is called a logit and it is a logarithm of odds of the two classes probabilities. NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 3/29

  4. Logistic Regression y ( x ; w ) = x w T To train the logistic regression , we use MLE (the maximum likelihood p ( C ∣ x ; w ) = σ ( y ( x ; w )) 1 estimation). Note that . X = {( x , t ), ( x , t ), … , ( x , t )} 1 1 2 2 N N Therefore, the loss for a batch is 1 ∑ L ( X ) = − log( p ( C ∣ x ; w )). t i i N i X ∈ R N × D t ∈ {0, +1} N α ∈ R + Input : Input dataset ( , ), learning rate . w ← 0 N until convergence (or until patience is over), process batch of examples: 1 ∑ i g ← ∇ − log( p ( C ∣ x ; w )) w t i N i w ← w − α g NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 4/29

  5. Linearity in Logistic Regression NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 5/29

  6. Generalized Linear Models The logistic regression is in fact an extended linear regression. A linear regression model, which a is followed by some activation function , is called generalized linear model : p ( t ∣ x ; w , b ) = a ( y ( x ; w , b ) ) = a ( x w + T b ). Name Activation Distribution Loss Gradient MSE ∝ E ( y ( x ) − t ) 2 ( y ( x ) − t ) ⋅ x linear regression identity ? NLL ∝ E − log( p ( t ∣ x )) σ ( x ) ( a ( y ( x )) − t ) ⋅ x logistic regression Bernoulli NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 6/29

  7. Mean Square Error as MLE During regression, we predict a number, not a real probability distribution. In order to generate a distribution, we might consider a distribution with the mean of the predicted value and a fixed σ 2 variance – the most general such a distribution is the normal distribution. NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 7/29

  8. Mean Square Error as MLE 2 p ( t ∣ x ; w ) = N ( t ; y ( x ; w ), σ ) Therefore, assume our model generates a distribution . Now we can apply MLE and get N ∑ p ( X ; w ) = arg max arg min − log p ( t ∣ x ; w ) i i w w i =1 N 2 1 ( t − y ( x ; w )) ∑ − i i = − arg min log e 2 σ 2 2 πσ 2 w i =1 N 2 ( t − y ( x ; w )) ∑ 2 −1/2 i i = − arg min N log(2 πσ ) + − 2 σ 2 w i =1 N N 2 ( t − y ( x ; w )) ∑ ∑ 2 i i 1 = arg min = arg min N ( t − y ( x ; w )) . i i 2 σ 2 w w i =1 i =1 NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 8/29

  9. Generalized Linear Models We have therefore extended the GLM table to Name Activation Distribution Loss Gradient ( y ( x ) − t ) ⋅ x Normal NLL ∝ MSE linear regression identity NLL ∝ E − log( p ( t ∣ x )) σ ( x ) ( a ( y ( x )) − t ) ⋅ x logistic regression Bernoulli NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 9/29

  10. Multiclass Logistic Regression K To extend the binary logistic regression to a multiclass case with classes, we: W ∈ R D × K K generate outputs, each with its own set of weights, so that for , T T y ( x ; W ) = x W , or in other words, y ( x ; W ) = x ( W ) ∗, i i softmax generalize the sigmoid function to a function, such that e y i softmax( y ) = . i ∑ j y e j Note that the original sigmoid function can be written as 1 e x σ ( x ) = softmax ( [ x 0] ) = = . 0 e + e 0 1 + e − x x The resulting classifier is also known as multinomial logistic regression , maximum entropy classifier or softmax regression . NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 10/29

  11. Multiclass Logistic Regression softmax From the definition of the function e y i softmax( y ) = , i ∑ j y e j y ( x ; W ) it is natural to obtain the interpretation of the model outputs as logits : y ( x ; W ) = log( p ( C ∣ x ; W )) + c . i i c The constant is present, because the output of the model is overparametrized (the probability of for example the last class could be computed from the remaining ones). This is connected to the fact that softmax is invariant to addition of a constant: + c e y e y e c i i softmax( y + c ) = = ⋅ = softmax( y ) . i i + c ∑ j ∑ j y y e c e e j j NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 11/29

  12. Multiclass Logistic Regression The difference between softmax and sigmoid output can be compared on the binary case, where the binary logistic regression model outputs are p ( C ∣ x ; w ) ( p ( C ) 1 y ( x ; w ) = log , ∣ x ; w ) 0 while the outputs of the softmax variant with two outputs can be interpreted as y ( x ; W ) = log( p ( C ∣ x ; W )) + y ( x ; W ) = log( p ( C ∣ x ; W )) + c c 0 0 1 1 and . y ( x ; W ) p ( C ∣ x ) 0 1 If we consider to be zero, the model can then predict only the probability , − log( p ( C ∣ x ; W )) c 0 and the constant is fixed to , recovering the original interpretation. K − 1 = 0 K y K Therefore, we could produce only outputs for -class classification and define , resulting in the interpretation of the model outputs analogous to the binary case: p ( C ∣ x ; W ) ( p ( C ) i y ( x ; W ) = log . i ∣ x ; W ) K NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 12/29

  13. Multiclass Logistic Regression softmax Using the function, we naturally define that e ( x W ) T i T p ( C ∣ x ; W ) = softmax( x W ) = . i i ( x W ) ∑ j T e j We can then use MLE and train the model using stochastic gradient descent. X ∈ R N × D t ∈ {0, 1, … , K − 1} N α ∈ R + Input : Input dataset ( , ), learning rate . w ← 0 N until convergence (or until patience is over), process batch of examples: 1 ∑ i g ← ∇ − log( p ( C ∣ x ; w )) w t i N i w ← w − α g NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 13/29

  14. Multiclass Logistic Regression Note that the decision regions of the binary/multiclass logistic regression are convex (and therefore connected). x x A B To see this, consider and in the same decision R k region . x Any point lying on the line connecting them is their x = λ x + (1 − λ ) x A B linear combination, , and from y ( x ) = W x the linearity of it follows that y ( x ) = λ y ( x ) + (1 − λ ) y ( x ). A B ( x ) y ( x ) ( x ) f f k A A k B Given that was the largest among and also given that was the largest y ( x ) ( x ) y ( x ) f B k among , it must be the case that is the largest among all . NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 14/29

  15. Generalized Linear Models The multiclass logistic regression can now be added to the GLM table: Name Activation Distribution Loss Gradient ( y ( x ) − t ) ⋅ x NLL ∝ MSE linear regression identity Normal NLL ∝ E − log( p ( t ∣ x )) ( a ( y ( x )) − t ) ⋅ x σ ( x ) logistic regression Bernoulli NLL ∝ E − log( p ( t ∣ x )) ( a ( y ( x )) − 1 ) ⋅ softmax( x ) x multiclass t categorical logistic regression NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 15/29

  16. Poisson Regression There exist several others GLMs, we now describe a last one, this time for regression and not for classification. Compared to regular linear regression, where we assume the output distribution is normal, we turn our attention to Poisson distribution . Poisson Distribution Poisson distribution is a discrete distribution suitable for modeling the probability of a given number of events occurring in a fixed time interval, if these events occur with a known rate and independently of each other. k − λ λ e P (x = k ; λ ) = k ! x It is easy to show that if has Poisson distribution, E [ x ] = λ Var( x ) = λ NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 16/29

Recommend


More recommend