Machine Learning Estimation Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/
Agenda Agenda Introduction Maximum Likelihood Estimation Maximum A Posteriori Estimation Bayesian Estimators Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2
Densi Density ty Esti Estimati mation on Model the probability distribution p(x) of a random variable x, given a finite set x1, . . . , xN of observations. The good estimator is: Unbiased: Sampling distribution of the estimator centers around the parameter value Efficient: Smallest possible standard error, compared to other estimators Methods for parameter estimation Maximum Likelihood Estimation (MLE) Maximum A Posteriori estimation (MAP) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3
Likel Likelihood ihood Func Functi tion on Consider n independent observations of x : x 1 , ..., x n , where x follows f ( x ; q ). The joint pdf for the whole data sample is: Now evaluate this function with the data sample obtained and regard it as a function of the parameter(s). This is the likelihood function: ( x i constant) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4
Maxim Maximum um Likel Likelihood ihood Esti Estimati mation ( on (MLE) MLE) Likelihood function: For each sample point 𝑦 , let 𝜾(𝒚) be the parameter value at which 𝑀 ( 𝜄 | 𝑦 ) attains its maximum as a function of 𝜄 . The MLE estimator of 𝜄 based on a sample 𝑦 is 𝜾(𝒚) . The MLE is the parameter point for which the observed sample is more likely. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5
Maxim Maximum um Likel Likelihood ihood Esti Estimati mation ( on (MLE) MLE) If the likelihood function is differentiable (in 𝜄𝑗 ), possible conditions for the MLE are the values ( 𝜄 1 ,…, 𝜄𝑙 ) that solve: Note that the solutions are possible candidates. To find exact MLE we should check Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6
Exa Exampl mple e 1 Adopted from slides of Harvard university Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7
Exa Exampl mple e 2 MLE for Gaussian with unknown mean Let 𝑦 1, 𝑦 2 ,…, 𝑦𝑜 be iid samples from 𝑂 ( 𝜄 ,1) . Find and MLE of 𝜄 . Solution: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8
Maxim Maximum um Likel Likelihood ihood Esti Estimati mation ( on (MLE) MLE) Sometimes it’s more convenient to use log likelihood . Let 𝑦 1, 𝑦 2 ,…, 𝑦𝑜 be iid samples from Bernouli ( 𝑞 ), then the likelihood function is: If 𝜄 is the MLE, then for any function 𝜐 ( 𝜄 ) the MLE of 𝜐 ( 𝜄 ) is 𝜐 ( 𝜄 ). Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9
Exa Exampl mple e 3 MLE for Gaussian with unknown mean and variance Let 𝒚 𝟐 , 𝒚 𝟑 , … , 𝒚 𝑺 be iid samples from 𝑶(𝝂, 𝝉 𝟑 ). Find the MLE for 𝜾 = (𝝂, 𝝉 𝟑 ) Solution: Prove that MLE for the variance of a Gaussian is biased! Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10
Property Property of of MLE MLE To use two-variate calculus to verify that a function 𝐼 ( 𝜄 1, 𝜄 2) has a maximum at 𝜄 1, 𝜄 2, it must be shown that the following three conditions hold: a) First order partial derivatives are zero: b) At least one second order partial derivative is negative: c) The Jacobian of second order derivatives is positive: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11
Exa Exampl mple e 4 MLE for Multinomial distribution (Hint: use Lagrange multipliers) Solution: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12
MLE: MLE: Mul Multi tinomial nomial d dis istr tributi ibution on To use two-variate calculus to verify that a function 𝐼 ( 𝜄 1, 𝜄 2) has a maximum at 𝜄 1, 𝜄 2, it must be shown that the following three conditions hold: a) First order partial derivatives are zero: b) At least one second order partial derivative is negative: c) The Jacobian of second order derivatives is positive: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13
Exa Exampl mple e 5 MLE for uniform distribution 𝑽 𝟏, 𝜾 Solution: Indicator function Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14
Maxim Maximum um A Posteri A Posteriori ori Estim Estimation ation Approximation: Instead of averaging over all parameter values Consider only the most probable value (i.e., value with highest posterior probability) Usually a very good approximation, and much simpler MAP value ≠ Expected value MAP → ML for infinite data (as long as prior ≠ 0 everywhere) Given a set of observations and a prior distribution on parameters, the parameter vector that maximizes 𝑞 ( | 𝜾 ) 𝑞 ( 𝜾 ) is found. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15
Maxim Maximum um A Posteri A Posteriori ori Estim Estimation ation Priors: Uninformative priors: Uniform distribution Conjugate priors: Closed-form representation of posterior, P(q) and P(q|D) have the same form Distribution Conjugate prior Binomial Beta Multinomial Dirichlet Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16
MA MAP P VS. VS. M MLE LE Adopted from slides of A. Zisserman Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17
MA MAP P VS. VS. M MLE LE MLE: Choose value that maximizes the probability of observed data: -Suffer from overfitting MAP: Choose value that is most probable given observed data and prior belief - Can avoid overfitting When MAP and MLE are the same? Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18
Exa Exampl mple e 6 MAP for Gaussian with unknown mean and having prior 𝟑 ) Find Let 𝒚 𝟐 , 𝒚 𝟑 , … , 𝒚 𝑶 be iid samples from 𝑶(𝝂, 𝝉 𝟑 ) and prior 𝑶(𝝂 0 , 𝝉 0 the MAP for 𝝂 Solution: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 19
Bayes Estim Bayes Estimators ators Suppose that we have a prior distribution for 𝜄 : 𝜌 ( 𝜄 ) Let 𝑔 ( 𝑦 | 𝜄 ) be the sampling distribution, then conditional distribution of 𝜄 given the sample 𝑦 is: where 𝑛 ( 𝑦 ) is the marginal distribution of 𝑦 : Sharif University of Technology, Computer Engineering Department, Machine Learning Course 20
Exa Exampl mple e 7 Estimation for Gaussian with unknown mean and having prior 𝟑 ) 𝒃𝒐𝒆 𝜾 ~ 𝑶 ( 𝝂, 𝝉 𝟑 ) Let N iid samples from 𝒚 𝒖 ~ 𝑶 (𝜾, 𝝉 𝟏 Solution: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 21
Bayesi Bayesian an Es Estim timato ators rs Both ML and MAP return only single and specific values for the parameter Θ. Bayesian estimation, by contrast, calculates fully the posterior distribution Prob (Θ|X). If: prior is well-behaved (i.e., does not assign 0 density to any “feasible” parameter value). Then: both MLE and Bayesian prediction converge to the same value as the number of training data increases. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 22
Any Q Any Questi uestion on End of Lecture 2 Thank you! Spring 2015 http://ce.sharif.edu/courses//93-94/2/ce717-1/ Sharif University of Technology, Computer Engineering Department, Machine Learning Course 23
Recommend
More recommend