comp24111 machine learning and optimisation
play

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Understand the concept of likelihood. Know some simple ways to build a likelihood function for


  1. COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk

  2. Outline • Understand the concept of likelihood. • Know some simple ways to build a likelihood function for classification and regression. • Understand logistic regression model. • Understand Newton-Raphson Update, and iterative reweighted least squares. • Understand linear basis function model (a nonlinear model). 1

  3. Linear Regression: Least Square (Chapter 2) 12 • The model assumes a linear relationship between the input 10 variables and the estimated output variables. 8 y = w T ! ˆ x y 6 • Model parameters are fitted by minimising sum of 4 squares error. A different way 2 1 1.5 2 2.5 3 to interpret this? x 2

  4. Probabilistic View • Assume the output 12 variable is a random number. 10 What is the chance we • It is generated by observe this sample? adding noise to a linear function. 8 ( ) + noise y = f x y = w T ! x + noise 6 4 Op3mise w by maximising the chances of 2 observing training 1 1.5 2 2.5 3 samples. x 3

  5. Likelihood • In informal context, likelihood means probability. • It is a function of parameters of a statistical model, computed with the given data. • A more formal definition: The likelihood of a set of parameter values ( w ) given the observed data ( x ) is the probability assumed for the observed data given those parameter values. ( ) = p x w ( ) Likelihood w x L(w) for simplification • Maximum likelihood estimator: the model parameters are optimised so that the probability of observing the training data is maximised. 4

  6. Maximum Likelihood for Linear Regression y = w T ! • The output variable is a random number: . x + noise • Noise is a random number. It follows Gaussian distribution and has zero mean ( μ =0 ). Standard deviation quantifies the amount of variation of a set of data values. Gaussian Distribution:mean µ , variance σ 2 0.4 standard deviation σ µ =0, σ =1 0.35 ⎛ ⎞ 2 µ =0, σ =2 ( ) 2 πσ 2 exp − x − µ 1 N ( x µ , σ 2 ) = ⎜ ⎟ µ =1, σ =1 ⎜ ⎟ 0.3 2 σ 2 ⎝ ⎠ 0.25 p(x) 0.2 0.15 0.1 0.05 0 -10 -5 0 5 10 x The above figure is from https://kanbanize.com/blog/normal-gaussian-distribution-over-cycle-time/ 5

  7. Maximum Likelihood for Linear Regression y = w T ! • Because , the output variable also follows Gaussian x + noise µ = w T ! distribution and its mean is x y w T ! ) = N ( ) ( x , β − 1 p y x , w , β β : noise precision (inverse variance). β -1 = σ 2 • Probability of observing the i-th training sample: y i w T ! ) = N ( ) ( x i , β − 1 p y i x i , w , β • Probability of observing all the N training samples: N N y i w T ! N ∏ ∏ ( x i , β − 1 ) ( ) = ( ) p Y X , w , β p y i x i , w , β = i = 1 i = 1 6

  8. Maximum Likelihood for Linear Regression • Likelihood function: ⎛ 2 ⎞ 2 πβ − 1 exp − β x − µ ( ) 1 N ( ) = ⎜ ⎟ x µ , β ⎜ ⎟ N 2 y i w T ! ⎝ ⎠ N ( ) ∏ x i , β − 1 ( ) = L w , β i = 1 • Log-likelihood function: taking the logarithm of the likelihood function N N y i w T ! = N 2 ln β − N ) − β 1 y i − w T ! 2 ln N ( ) = ∑ ( x i , β − 1 ) ∑ ( ) ( ) = ln L w , β ( ) ( O w , β 2 ln 2 π x i 2 i = 1 i = 1 This is the sum-of-squares error function in Chapter 2. • Optimise w by maximising the log likelihood function is equivalent to minimising the sum-of-squares error function, under the assumption of additive zero-mean Gaussian noise. 7

  9. Multivariate Gaussian Distribution • Multivariate Gaussian Distribution: case 1 3 0.2 2 mean µ , covariance matrix Σ : 0.15 p(x 1 ,x 2 ) 1 0.1 T Σ − 1 x − µ ⎛ ⎞ ( ) ( ) exp − x − µ 1 x 2 0 N 0.05 ( ) = ⎜ ⎟ x µ , Σ ⎜ ⎟ d Σ -1 0 2 ( ) 2 π ⎝ ⎠ 2 2 -2 0 0 -2 -2 x 2 -3 x 1 -3 -2 -1 0 1 2 3 Covariance measures the joint variability of two x 1 random variables cov(x, y) = E[(x-E[x])(y-E[y])] 3 0.2 2 • A bi-variate example: 0.15 p(x 1 ,x 2 ) 1 0.1 correlation between x 1 and x 2 x 2 0 0.05 ⎛ ⎞ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0 -1 2 x 1 µ 1 σ 1 ρσ 1 σ 2 ⎜ ⎟ N ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ , 2 -2 ⎜ ⎟ 2 ⎢ ⎥ x 2 2 ⎜ ⎢ ⎥ ⎢ µ 2 ⎥ ⎟ 0 ρσ 1 σ 2 σ 2 ⎣ ⎦ ⎣ ⎦ 0 ⎣ ⎦ ⎝ ⎠ -2 -3 -2 x2 -3 -2 -1 0 1 2 3 x1 x 1 ⎛ ⎞ ⎡ 2 2 ⎤ ( ) ( ) ( ) x 2 − µ 2 ( ) x 1 − µ 1 + x 2 − µ 2 − 2 ρ x 1 − µ 1 1 1 3 ⎜ ⎟ 2 πσ 1 σ 2 1 − ρ 2 exp − ⎢ ⎥ = ⎜ ⎟ 2 2 2 1 − ρ 2 σ 1 σ 2 σ 1 σ 2 ⎢ ⎥ 2 ⎣ ⎦ ⎝ ⎠ 0.4 1 p(x 1 ,x 2 ) 0.2 x 2 0 Case 1: µ 1 = µ 2 = 0, σ 1 = σ 2 = 1, ρ = 0 0 -1 Case 2 : µ 1 = µ 2 = 0, σ 1 = σ 2 = 1, ρ = 0.5 2 -2 2 0 0 Case 3: µ 1 = µ 2 = 1, σ 1 = 0.2, σ 2 = 1, ρ = 0 -2 -2 -3 x2 -3 -2 -1 0 1 2 3 x1 x 1 8

  10. Maximum Likelihood for Binary Classification (Gaussian Distribution) • The probability of observing a sample belonging to one of the two possible classes follows the Bernoulli distribution (a simple probabilistic model for fliping coins): Flip coin: ⎧ θ 1 , if y = 1 (head), ⎪ 1 − y = y θ 2 p = θ 1 ⎨ ⎧ θ 2 , if y = 0 (tail). ⎪ ( ) , ⎩ θ x ,1 if y = 1, ⎪ 1 − y = y θ x ,1 − y ( ) = θ x , y ( ) ( ) p x , y ⎨ ( ) , θ x ,0 if y = 0. ⎪ ⎩ • Samples from each class are random variables following Gaussian distribution . Prior class probability: ) = α N ( ( x µ 1 , Σ ) ( ) = p C 1 ( ) p x C 1 θ x ,1 ( ) = α p C 1 ) N x µ 2 , Σ ( ) = 1 − α ( ) ( ) = p C 2 ( ) p x C 2 ( θ x ,0 ( ) = 1 − α ( ) p C 2 Assume different classes have different mean vectors (µ 1 and µ 2 ), but the same covariance matrix Σ . 9

  11. Maximum Likelihood for Binary Classification (Gaussian Distribution) • Likelihood function: N N y i 1 − y i α N ) N ⎡ ⎤ ⎡ ⎤ x i µ 1 , Σ x i µ 2 , Σ ∏ ∏ ( ) ( ) ( ) = ( ) ( L α , µ 1 , µ 2 , Σ p x i , y i 1 − α = ⎣ ⎦ ⎣ ⎦ i = 1 i = 1 • Log-likelihood function ⎧ ⎫ N y i 1 − y i α N ) N ⎡ ⎤ ⎡ ⎤ ∏ ( x i µ 1 , Σ ) ( x i µ 2 , Σ ) ( ) = ln ( O α , µ 1 , µ 2 , Σ 1 − α ⎨ ⎬ ⎣ ⎦ ⎣ ⎦ ⎩ ⎭ i = 1 N N N y i ln N ) ln N ∑ ∑ x i µ 1 , Σ ∑ x i µ 2 , Σ ⎡ ⎤ ( ) ( ) ( ) ln 1 − α ( ) ( y i ln α + 1 − y i 1 − y i = + + ⎣ ⎦ i = 1 i = 1 i = 1 10

  12. Maximum Likelihood for Binary Classification (Gaussian Distribution) • We need to decide the optimal setting of the following model parameters. µ 1 : mean vector of class 1 α : class prior µ 2 : mean vector of class 2 Σ : shared covariance matrix for both classes • Optimal parameters are obtained by setting the gradients to zero. ∂ O α , µ 1 , µ 2 , Σ ( ) = 0 ⇒ α * = N 1 • The prior probability of a class is simply the fraction of the training N ∂ α samples in that class. N ∂ O α , µ 1 , µ 2 , Σ ( ) 1 = 1 = 0 ⇒ µ * ∑ y i x i • The mean vector of each class is N 1 ∂ µ 1 i = 1 simply the averaged training ( ) N ∂ O α , µ 1 , µ 2 , Σ 2 = 1 samples in that class. ∑ = 0 ⇒ µ * ( ) x i 1 − y i N 2 ∂ µ 2 • The covariance matrix is simply a i = 1 ∂ O α , µ 1 , µ 2 , Σ ( ) weighted average of the = 0 ⇒ Σ * = N 1 N Σ 1 + N 2 N Σ 2 covariance matrices associated ∂Σ with each of the two classes. where Σ C = 1 T , C = 1,2 ∑ ( ) ( ) x i − µ C x i − µ C N C i ∈ Class C 11

  13. Example: Binary Classification 20 training samples from class A, each characterised by 2 features. 20 training samples from class B, each characterised by 2 features. training samples and separation boundary Red region: ( ) p y = class A, x α * , µ 1 * , µ 2 * , Σ * ( ) < p y = class B, x α * , µ 1 * , µ 2 * , Σ * Blue region: ( ) p y = class A, x α * , µ 1 * , µ 2 * , Σ * ( ) ≥ p y = class B, x α * , µ 1 * , µ 2 * , Σ * 12

  14. • We just used the following model to formulate a likelihood function for binary classification. y θ x ,0 1 − y ( ) = θ x ,1 ( ) ( ) p x , y : The probability ( ) θ x ,1 ) = α N of observing (x, class 1). ( x µ 1 , Σ ) , ( θ x ,1 ) N x µ 2 , Σ ( ) , ( ) = 1 − α ( θ x ,0 : The probability ( ) θ x ,0 N of observing (x, class 2). ∏ ( ) = ( ) L α , µ 1 , µ 2 , Σ p x i , y i , i = 1 • Is there another way to formulate the likelihood function for classification? 13

  15. Logistic Regression: Binary Classification • Another way to construct the likelihood function is: Given class label y ∈ 0,1 } : { ( ) : Given an observe sample θ y = 1 x y θ y = 0 x 1 − y ⎡ ⎤ ( ) = θ y = 1 x ( ) ( ) p y x ⎣ ⎦ x , the probability it is from class 1. y 1 − θ y = 1 x 1 − y ⎡ ⎤ ( ) ( ) = θ y = 1 x ⎣ ⎦ N ( ) θ y = 0 x : Given an observe sample ∏ ( ) Likelihood = p y i x i , i = 1 x , the probability it is from class 0. ( ) + θ y = 1 x ( ) = 1 ( ) θ y = 0 x We directly model θ y = 1 x ) = σ w T ! 1 ( ) = ( θ y = 1 x x 1 + exp − w T ! ( ) 1 x σ x ( ) = ( ) 1 + exp − x This is called logistic This is a linear model sigmoid function. as learned in Chapter 2. 14

Recommend


More recommend