review of estimation theory
play

Review of Estimation Theory Berlin 2003 References: 1. X. Huang - PowerPoint PPT Presentation

Review of Estimation Theory Berlin 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 3 Introduction Estimation theory is the most important theory and method in statistical inference Statistical inference


  1. Review of Estimation Theory Berlin 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 3

  2. Introduction • Estimation theory is the most important theory and method in statistical inference • Statistical inference – Data generated in accordance with some unknown probability distribution must be analyzed – Some type of inference about the unknown distribution must be made like the characteristics (parameters) of the distribution generating the experimental data, the mean and variance etc. ( ) ( ) θ X { } The vector of g X Φ = X X , X ,..., X 1 2 n random variables estimator { } ( ) The vector of ( ) = x x , x ,..., x θ x g x Φ 1 2 n sample values estimate Φ :the parameters of the distribution 2

  3. Introduction • Three common estimators (estimation methods) – Minimum mean square estimator • Estimate the random variable itself • Function approximation, curve fitting, … – Maximum likelihood estimator • Estimate the parameters of the distribution of the random variables – Bayes’ estimator • Estimate the parameters of the distribution of the random variables 3

  4. Minimum Mean Square Error Estimation and Least Square Error Estimation • There are two random variables and . When X Y observing the value of , we want to find a X ( ) ˆ transform ( the parameter vectors of Y = Φ g X , Φ function ) to predict the value of g Y – Minimum Mean Square Error Estimation [ [ ] ] If the joint distribution ( ( ) ) 2 = − Φ arg min E Y g X , Φ ( ) Is known MMSE f X , Y Φ X , Y – Least Square Error Estimation ( ) x , i y When samples of i n [ ] ( ) ∑ pairs are observed 2 = − Φ arg min y g x , Φ LSE i i Φ = i 1 • Base on the law of large numbers, when the joint ( ) ˆ probability is uniform or the number of samples = Y f X , Y X , Y approaches to infinity, MMSE and LSE are equivalent 4

  5. Minimum Mean Square Error Estimation and Least Square Error Estimation ( ) • Constant functions = g X c – MMSE -- LSE ( ) n [ ] ∇ − 2 = y c 0 ∑ ( ) 2 ∇ − = c i E Y c 0 = i 1 c [ ] 1 n ∴ = c E Y ∴ = c y ∑ MMSE LSE i n = sample mean mean i 1 ( ) = + • Linear functions g X aX b – MMSE [ ] [ ] ( ) [ ] [ ] 2 ∇ − + = E Y ( aX b ) 0 + − = 2 aE X bE X E XY 0 a [ ] [ ] [ ] ( ) ∇ − + 2 = + − = E Y ( aX b ) 0 aE X b E Y 0 b ( ) cov X , Y = a ( ) Var X σ [ ] [ ] = − ρ Y b E Y E X XY σ X 5

  6. Minimum Mean Square Error Estimation and Least Square Error Estimation • Linear functions – LSE • Suppose that x are d-dimensional vectors and y are scalars c 0 c d c 1    y  1 d  a  1 x x L 1 0 1 1       1 d y a 1 x x L       ˆ = 2 = = 1 2 2 Y XA       M M M M M       y 1 d a  1 x x    L     n n n n ( ) n ( ) ∑ ˆ 2 = − = − t e A Y Y A x y i 1 = i 1 ( ) n ( ) ( ) ∑ ∇ = t − = t − = e A 2 A x y x 2 X XA Y 0 i i i = i 1 Y ⇒ = t t X XA X Y ( ) c 0 c 1 − ˆ 1 ⇒ = t t Y A X X X Y c d ..... 6

  7. Maximum Likelihood Estimation (MLE/ML) • ML is the most widely used parametric estimation method { } • A set of random samples is to be drawn = X X , X ,..., X 1 2 n independently according to a distribution ( ) with the pdf p x Φ ( ) = x x , x ,..., x – Given a sequence of random samples the 1 2 n ( ) ( ) x , x ,..., x likelihood of it is defined as , a joint pdf of p x Φ 1 2 n n ( ) ( ) n = p x Φ p x Φ , X , X , ... X are iid Q ∏ n k 1 2 n = k 1 – Maximum likelihood estimator of is denoted as Φ ( ) ( ) n = = Φ arg max p x Φ arg max p x Φ ∏ ML n k = k 1 Φ Φ – Since the logarithm function is monotonically increasing function , Φ the parameter set that maximizes the log-likelihood should ML also maximize the likelihood. The log-likelihood can be ( ) ( ) ( ) expressed as: n = = l Φ log p x Φ log p x Φ ∑ 7 n k = k 1

  8. Maximum Likelihood Estimation (MLE/ML) ( ) • If is differentiable function of , can be p x Φ Φ Φ n ML attained by taking the partial derivative with respect to and setting it to zero Φ ( ) = – Let be a M-component parameter vector t Φ Φ Φ , Φ ,..., Φ 1 2 M ( ) ∂   l Φ   ∂ Φ   1 . ( ) ( )   n ∇ = ∇ = = l Φ log p x Φ 0 ∑   Φ Φ k . = k 1   ( ) ∂ l Φ   ∂   Φ   M ( ) • Example: is a univariate Gaussian pdf with the p x Φ ( ) µ , σ parameter set 2 ( )   − µ 2 ( ) 1 x = − p x Φ exp   σ π σ 2 2 2   ( )     − µ 2 ( ) ( ) 1 x n 1 ( ) ( )   n n n = = − = − πσ − − µ 2 log p x Φ log p x Φ log exp log 2 2 x k   ∑ ∑ ∑   σ σ n k π σ k 2 2 2 2 2 2 = =   = k 1 k 1   k 1 8

  9. Maximum Likelihood Estimation (MLE/ML) • Example: univariate Gaussian pdf (cont.) – Take the partial derivatives of the above expression and set them to zero Φ itself is fixed but unkown ∂ ( ) 1 ( ) n = − µ = log p x Φ x 0 ∑ n k ∂ µ σ 2 = k 1 ( ) 2 ∂ − µ ( ) n x n = − + = k log p x Φ 0 ∑ n ∂ σ 2 σ 2 σ 4 = k 1 µ – The maximum likelihood estimates for and are σ 2 ( ) n µ = = x E x ∑ ML k = k 1 [ ] 1 ( ) ( ) σ = − µ = − µ 2 2 2 x E x ML k ML k ML n • The maximum likelihood estimation for mean and variance is just the sample mean and variance 9

  10. Maximum Likelihood Estimation (MLE/ML) • Example: multivariate Gaussian pdf (cont.) ( )   1 1 ( ) ( )  t − = − − 1 − p x Φ exp x µ Σ x µ  d 2   ( ) 1 2 2 π Σ 2 µ – The maximum likelihood estimates for and are Σ 1 n = ˆ µ x ∑ MLE k n = k 1 1 ( )( ) n ˆ t = − − ˆ ˆ Σ x µ x µ ∑ MLE k MLE k MLE n = k 1 [ ] ( )( ) t = − − ˆ ˆ E x µ x µ k MLE k MLE • The maximum likelihood estimation for mean vector and variance matrix is just the sample mean vector and variance matrix Φ • In fact, itself is also a Gaussian distribution MLE 10

  11. Bayesian Estimation • Bayesian estimation has a different philosophy than maximum likelihood (ML) estimation – ML assumes the parameter set is fixed but unknown (non- Φ informative, uniform prior) – Bayesian estimation assumes the parameter set itself is a Φ ( ) p Φ random variable with a prior distribution ( ) = x x , x ,..., x – Given a sequence of random samples , which are 1 2 n ( ) i.i.d. with a joint pdf , the posterior distribution of can p x Φ Φ be the following according to the Bayes’ rule ( ) ( ) p x Φ p Φ ( ) ( ) ( ) = ∝ p Φ x p x Φ p Φ ( ) p x 11

  12. Bayesian Estimation ( ) : the posterior probability, the distribution of Φ • p Φ x after we observed the values of random variables ( ) p Φ • : a conjugate prior of the random variables (or vector) is defined as the prior distribution for the parameters of the density function (e.g. ) of the random variables (or vectors) Φ – Before we observed the values of random variables • The joint pdf/likelihood function     2 2 − Φ − Φ     ( ) 1 1 n x 1 n x ∑ ∑ = − ∝ −     p x Φ exp  i  exp  i  σ σ n ( ) 2   2       2 σ     n 2 π = = i 1 i 1 • The prior is also a Gaussian distribution     2 2 Φ − µ Φ − µ 1 1   1   ( ) = − ∝ − p Φ exp     exp     ν ν 1 ( ) 2   2       2 ν     2 π 12

Recommend


More recommend