machine learning mt 2016 3 maximum likelihood
play

Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade - PowerPoint PPT Presentation

Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade University of Oxford October 17, 2016 Outline Probabilistic Perspective of Machine Learning Probabilistic Formulation of the Linear Model Maximum Likelihood Estimate


  1. Machine Learning - MT 2016 3. Maximum Likelihood Varun Kanade University of Oxford October 17, 2016

  2. Outline Probabilistic Perspective of Machine Learning ◮ Probabilistic Formulation of the Linear Model ◮ Maximum Likelihood Estimate ◮ Relation to the Least Squares Estimate 1

  3. Outline Probability Review Linear Regression and Maximum Likelihood Information, Entropy, KL Divergence

  4. Univariate Gaussian (Normal) Distribution The univariate normal distribution is defined by the following density function e − ( x − µ )2 1 X ∼ N ( µ, σ 2 ) √ p ( x ) = 2 σ 2 2 πσ Here µ is the mean and σ 2 is the variance. � ∞ p ( x ) d x = 1 −∞ � ∞ σ xp ( x ) d x = µ −∞ � ∞ Pr( a ≤ x ≤ b ) ( x − µ ) 2 p ( x ) d x = σ 2 a b Mean µ −∞ 2

  5. Sampling from a Gaussian distribution Sampling from X ∼ N ( µ, σ 2 ) By setting Y = X − µ , sample from Y ∼ N (0 , 1) σ Cumulative distribution function � x 1 e − t 2 √ 2 dt Φ( x ; 0 , 1) = 2 π −∞ Φ( a ) a 1 1 Φ( a ) y ∼ Unif([0 , 1]) 0 0 a x 3

  6. Bivariate Normal (Gaussian) Distribution Suppose X 1 ∼ N ( µ 1 , σ 2 1 ) and X 2 ∼ N ( µ 2 , σ 2 2 ) are independent The joint probability distribution p ( x 1 , x 2 ) is a bivariate normal distribution. p ( x 1 , x 2 ) = p ( x 1 ) · p ( x 2 ) � � � � − ( x − µ 1 ) 2 − ( x − µ 2 ) 2 1 1 √ √ = · exp · · exp 2 σ 2 2 σ 2 2 πσ 1 2 πσ 2 1 2  � � ( x − µ 1 ) 2 + ( x − µ 2 ) 2 1  −  = 2 ) 1 / 2 · exp 2 π ( σ 2 1 σ 2 2 σ 2 2 σ 2 1 2 � � 1 − 1 2 · ( x − µ ) T Σ − 1 ( x − µ ) = 2 π | Σ | 1 / 2 · exp where � � � � � � σ 2 0 µ 1 x 1 1 Σ = µ = x = σ 2 0 µ 2 x 2 2 Note: All equiprobable points lie on an ellipse. 4

  7. Covariance and Correlation For random variable X and Y the covariance measures how the random variable change jointly � � cov( X, Y ) = E ( X − E [ X ]) · ( Y − E [ Y ]) Covariance depends on the scale. The (Pearson) correlation coefficient normalizes the covariance to give a value between − 1 and +1 . cov( X, Y ) corr( X, Y ) = � , var( X ) · var( Y ) where var( X ) = E [( X − E [ X ]) 2 ] and var( Y ) = E [( Y − E [ Y ]) 2 ] . Independent variables are uncorrelated, but the converse is not true! 5

  8. Multivariate Gaussian Distribution Suppose x is a D -dimensional random vector. The covariance matrix consists of all pairwise covariances.   var( X 1 ) cov( X 1 , X 2 ) cov( X 1 , X D ) · · · cov( X 2 , X 1 ) var( X 2 ) cov( X 2 , X D )  · · ·  � ( x − E [ x ])( x − E [ x ]) T �   cov( x ) = E = . . . ... .   . . .  . . .    cov( X D , X 1 ) cov( X D , X 2 ) var( X D ) · · · If µ = E [ x ] and Σ = cov( x ) , the multivariate normal is defined by the density � � 1 − 1 2( x − µ ) T Σ − 1 ( x − µ ) N ( µ , Σ ) = (2 π ) D/ 2 | Σ | 1 / 2 exp 6

  9. Suppose you are given three independent samples: x 1 = 0 . 3 , x 2 = 1 . 4 , and x 3 = 1 . 7 You know that the data is generated from either N (0 , 1) or N (2 , 1) . Let θ represent the parameters ( µ, σ ) of the two distributions. Then the probability of observing the data with parameter θ is called the likelihood. p ( x 1 , x 2 , x 3 | θ ) = p ( x 1 | θ ) · p ( x 2 | θ ) · p ( x 3 | θ ) We have to choose between θ = (0 , 1) or θ = (2 , 1) . Which one is more likely? x 1 x 2 x 3 µ = 0 µ = 2 Maximum Likelihood Estimation (MLE) Pick parameter θ that maximises the likelihood 7

  10. Outline Probability Review Linear Regression and Maximum Likelihood Information, Entropy, KL Divergence

  11. Linear Regression Linear Model y = w 0 x 0 + w 1 x 1 + · · · + w D x D + ǫ = w · x + ǫ Noise/uncertainty Model y given x , w as a random variable with mean w T x . E [ y | x , w ] = w T x We will be specific in choosing the distribution of y given x and w . Let us assume that given x , w , y is normal with mean w T x and variance σ 2 p ( y | w , x ) = N ( w T x , σ 2 ) = w T x + N (0 , σ 2 ) Alternatively, we may view this model as ǫ ∼ N (0 , σ 2 ) (Gaussian Noise) Discriminative Framework Throughout this lecture, think of the inputs x 1 , . . . , x N as fixed 8

  12. Likelihood of Linear Regression (Gaussian Noise Model) Suppose we observe data � ( x i , y i ) � N i =1 . What is the likelihood of observing the data for model parameters w , σ ? MLE Estimator Find parameters which maximise the likelihood. (product of ‘‘likelihood density’’ segments) Least Square Estimator Find parameters which minimise the sum of squares of the residuals (sum of squares of the segments). 9

  13. Likelihood of Linear Regression (Gaussian Noise Model) Suppose we observe data � ( x i , y i ) � N i =1 . What is the likelihood of observing the data for model parameters w , σ ? N � � � � � p y 1 , . . . , y N | x 1 , . . . , x N , w , σ = p y i | x i , w , σ i =1 According to the model y i ∼ w T x i + N (0 , σ 2 ) � � N � � � − ( y i − w T x i ) 2 1 √ p y 1 , . . . , y N | x 1 , . . . , x N , w , σ = 2 πσ 2 exp 2 σ 2 i =1   � � N/ 2 N � 1  − 1 ( y i − w T x i ) 2  = exp 2 σ 2 · 2 πσ 2 i =1 Want to find parameters w and σ that maximise the likelihood 10

  14. Likelihood of Linear Regression (Gaussian Noise Model) Let us consider the likelihood p ( y | X , w , σ )   � � N/ 2 N � � � 1  − 1 ( y i − w T x i ) 2  p y 1 , . . . , y N | x 1 , . . . , x N , w , σ = exp 2 σ 2 · 2 πσ 2 i =1 As log : R + → R is an increasing function, we can instead maximise the log of the likelihood (called log-likelihood), which results in a simpler mathematical expression. N � LL( y 1 , . . . , y N | x 1 , . . . , x N , w , σ ) = − N 1 2 log(2 πσ 2 ) − ( y i − w T x i ) 2 2 σ 2 i =1 LL( y | X , w , σ ) = − N 1 2 log(2 πσ 2 ) − 2 σ 2 ( Xw − y ) T ( Xw − y ) In vector form, Let’s first find w that maximizes the log-likelihood 11

  15. Maximum Likelihood and Least Squares Estimates We’d like to find w that maximises the log-likelihood 1 LL( y | X , w , σ ) = − N 2 log(2 πσ 2 ) − 2 σ 2 ( Xw − y ) T ( Xw − y ) Alternatively, we can minimise the negative log-likelihood 2 σ 2 ( Xw − y ) T ( Xw − y ) + N 1 2 log(2 πσ 2 ) NLL( y | X , w , σ ) = Recall the objective function we used for the least squares estimate in the previous lecture 1 2 N ( Xw − y ) T ( Xw − y ) L ( w ) = For minimizing with respect to w , the two objectives are the same upto a constant additive and multiplicative factor! 12

  16. Maximum Likelihood Estimate for Linear Regression As the solution w ML to find the maximum likelihood estimator is the same as the least squares estimator, we have � � − 1 X T X X T y w ML = Recall the form of the negative log-likelihood 2 σ 2 ( Xw − y ) T ( Xw − y ) + N 1 2 log(2 πσ 2 ) NLL( y | X , w , σ ) = We can also find the maximum likelihood estimate for σ Exercise on sheet 2 to show that the MLE of σ is given by ML = 1 σ 2 N ( Xw ML − y ) T ( Xw ML − y ) 13

  17. Prediction using the MLE for Linear Regression Given training data � ( x i , y i ) � N i =1 , we can obtain the MLE w ML and σ ML . One a new point x new , we can use these to make a prediction and also give confidence intervals y new = w ML · x new � y new + N (0 , σ 2 y new ∼ � ML ) 14

  18. Summary : MLE for Linear Regression (Gaussian Noise) Model ◮ Linear model: y = w · x + ǫ ◮ Explicitly model ǫ ∼ N (0 , σ 2 ) Maximum Likelihood Estimation ◮ Every w , σ defines a probability distribution over observed data ◮ Pick w and σ that maximise the likelihood of observing the data Algorithm ◮ As in the previous lecture, we have closed form expressions ◮ Algorithm simply implements elementary matrix operations 15

  19. Outliers and Laplace Distribution If the data has outliers, we can model the noise using a distribution that has heavier tails For the linear model y = w · x + ǫ , use ǫ ∼ Lap(0 , b ) , where the density function for Lap( µ, b ) is given by � � p ( x ) = 1 −| x − µ | 2 b exp b 0 . 3 1 0 . 2 0 . 5 0 . 1 0 0 − 3 − 2 − 1 0 1 2 3 − 6 − 4 − 2 0 2 4 6 Laplace and normal distributions with the same mean and variance 16

  20. Maximum Likelihood for Laplace Noise Model Given data � ( x i , y i ) � N i =1 , let us express the likelihood of observing the data in terms model parameters w and b . � � N � −| y i − w T x i | 1 p ( y 1 , . . . , y N | x 1 , . . . , x N , w , b ) = 2 b exp b i =1   N � 1  − 1 | y i − w T x i |  = (2 b ) N exp b i =1 As in the case of the Gaussian noise model, we look at the negative log-likelihood N � NLL( y | X , w , b ) = 1 | y i − w T x i | + N log(2 b ) b i =1 Thus, the maximum likelihood estimate in this case can be obtained by minimising the sum of the absolute values of the residuals, which is the same objective we discussed in the last lecture in the context fitting a linear model that is robust to outliers. 17

  21. Outline Probability Review Linear Regression and Maximum Likelihood Information, Entropy, KL Divergence

Recommend


More recommend