− Rao r − Cramé ér Rao Bounds and Bounds and Cram Monte Carlo Calculation of the Monte Carlo Calculation of the Fisher Information Matrix Fisher Information Matrix Interfaces 2004 Interfaces 2004 James C. Spall The Johns Hopkins University Applied Physics Laboratory james.spall@jhuapl.edu
Introduction Introduction • Fundamental role of data analysis is to extract information from data • Parameter estimation for models is central to process of extracting information • The Fisher information matrix plays a central role in parameter estimation for measuring information: Information matrix summarizes the amount Information matrix summarizes the amount of information in the data relative to the of information in the data relative to the parameters being estimated being estimated parameters 2
Problem Setting Problem Setting • Consider the classical statistical problem of estimating parameter vector θ from n data vectors z 1 , z 2 ,…, z n • Suppose have a probability density and/or mass function associated with the data The parameters θ appear in the probability function and affect the • nature of the distribution Example: z i ∼ N (mean( θ ), covariance( θ )) for all i – Let L ( θ | z 1 , z 2 ,…, z n ) represent the likelihood function, i.e., the • p.d.f./p.m.f. viewed as a function of θ conditioned on the data 3
Selected Applications Selected Applications • Information matrix is measure of performance for several applications. Four uses are: 1. Confidence regions for parameter estimation – Uses asymptotic normality and/or Cramer-Rao inequality 2. Prediction bounds for mathematical models 3. Basis for “ D -optimal” criterion for experimental design Information matrix serves as measure of how well θ can – be estimated for a given set of inputs 4. Basis for “noninformative prior” in Bayesian analysis – Sometimes used for “objective” Bayesian inference 4
Information Matrix Information Matrix Recall likelihood function L ( θ | z 1 , z 2 ,…, z n ) • • Information matrix defined as ∂ ∂ ⎛ ⎞ log L log L = θ ( ) F E ⎜ ⎟ n ∂ θ ∂ θ T ⎝ ⎠ where expectation is w.r.t. z 1 , z 2 ,…, z n • Equivalent form based on Hessian matrix: ⎛ 2 log ⎞ ∂ L = θ − ⎜ ⎟ ( ) F E ⎜ ⎟ n ∂ ∂ θ θ T ⎝ ⎠ F n ( θ ) is positive semidefinite of dimension p × p ( p =dim( θ )) • 5
Information Matrix (cont’d) Information Matrix (cont’d) Connection of F n ( θ ) and uncertainty in estimate is θ ˆ n • rigorously specified via two famous results ( θ ∗ = true value of θ ): 1. Asymptotic normality: 1. Asymptotic normality: ∗ − dist θ − θ ⎯⎯⎯ → ˆ 1 ( ) ( ) N 0 n , F n where ∗ ≡ lim θ ( ) F F n n →∞ n 2. Cramé ér r- -Rao Rao inequality: inequality: 2. Cram ∗ − ≥ θ θ 1 ˆ ˆ ( ) ( ) cov F for all n n n Above two results indicate: greater variability of ⇒ “smaller” F n ( θ ) (and vice versa) θ ˆ n 6
Computation of Information Matrix Computation of Information Matrix Analytical formula for F n ( θ ) requires first or second • derivative info. and expectation calculation – Often impossible or very difficult to compute in real-world models – Involves expected value of highly nonlinear (possibly unknown) functions of data • Schematic below summarizes “easy” Monte Carlo- based method for determining F n ( θ ) – Uses averages of very efficient (simultaneous perturbation) Hessian estimates – Hessian estimates evaluated at artificial (pseudo) data – Computational horsepower instead of analytical analysis 7
Schematic of Monte Carlo Method for Schematic of Monte Carlo Method for Estimating Information Matrix Estimating Information Matrix 8
Optimal Implementation Optimal Implementation Several implementation questions/answers: • Q. How to compute (cheap) Hessian estimates? Q. A. Use simultaneous perturbation (SP) based method A. ( IEEE Trans. Auto. Control , 2000, pp. 1839–1853) Q. How to allocate per-realization ( M ) and across- Q. realization ( N ) averaging? A. M = 1 is the optimal solution for a fixed total number of A. Hessian estimates. However, M > 1 is useful when accounting for cost of generating pseudo data. Q. Can correlation be introduced to improve overall Q. θ accuracy of ? F , ( ) M N A. Yes, antithetic random numbers can reduce variance A. θ of the elements in . Discussed on slides below. F , ( ) M N 9
Antithetic Random Numbers Antithetic Random Numbers • Above solution ( M = 1) assumes all Hessian estimates generated with independent perturbation vectors • Is it possible to introduce correlated perturbations to reduce variability? • Implemented based on M >1 – Contrasts with optimal solution above of M = 1 • Antithetic random numbers (ARNs ARNs) ) are a way to • Antithetic random numbers ( reduce variability of sums of pseudo random numbers – Contrast with common random numbers for differences of pseudo random numbers • Based on introducing negative correlation according to var( X + Y ) = var( X ) + var( Y ) + 2cov( X , Y ) 10
Implementing Antithetic Random Numbers Implementing Antithetic Random Numbers • Implementing ARNs represents both art and science – Typically more difficult than common random numbers • Possible to write down analytical basis for “best” implementation of ARNs – Unusable in practice – Requires full knowledge of true Hessian values • Practical implementation requires problem insight and approximations • Not a panacea, but sometimes useful to increase accuracy and/or reduce computational cost 11
Numerical Experiments for Monte Carlo Numerical Experiments for Monte Carlo Method of Estimating Information Matrix Method of Estimating Information Matrix Consider a problem of estimating µ and Σ from data z i ∼ N ( µ , • Σ + P i ) ∀ i. Let n = 30 – A problem with known information matrix – Useful for comparing approach here with known result – P i ’s assumed known (non-identical) Have dim( z i ) = 4 and dim( θ ) = 14 • ⇒ 14 × (14+1) / 2=105 unique elements in F n ( θ ) need to be calculated • Real-world implementation of Monte Carlo method is for problems where solution is not known (unlike this example) 12
Evaluation Criteria Evaluation Criteria θ • Let denote the estimate for the Fisher info. F , ( ) M N matrix from M Hessian estimates at each pseudodata vector and N pseudodata vectors θ • Many ways of comparing and the true matrix F , ( ) M N F n ( θ ) = F 30 ( θ ) • As summary measure we use the standard matrix (spectral) norm (scaled): θ − θ F , ( ) F ( ) M N n = norm θ F ( ) n 13
Focus of Numerical Experiments Focus of Numerical Experiments • Two tables below show results of numerical studies of various implementations – Optimality of M = 1 under fixed budget B = MN of Hessian estimates – Value of gradient information (when available) in improving estimate – Value of ARNs • Assume only likelihood values are available (i.e., no gradient) in study of M = 1 – Crude Hessian estimates based on difference of SP gradient estimates – Harder to obtain good Hessian estimates than when exact gradient is available 14
Two Studies: Optimality of M M = 1 and = 1 and Two Studies: Optimality of Value of Gradient Information Value of Gradient Information • • Values in columns (a) , (b) , and (c) are scaled matrix norms; P - -v va al lu ue es s shown to right associated statistical P • • Constant budget B of SP Hessian estimates ( B = MN ) • • P -values based on two-sided t -test M = 1 M = 20 M = 1 N = 40,000 N = 2000 N = 40,000 P -value P -value Likelihood Likelihood Gradient (a) vs. (b) (a) vs. (c) Values Values Values (a) (b) (c) < 10 − 10 0.0502 0.0532 0.0183 0.0009 15
µ Test of Antithetic Random Numbers for µ Test of Antithetic Random Numbers for θ ): Matrix Norms and ( θ Portion of F F n ): Matrix Norms and P P - -Value Value Portion of n ( • • Constant budget of SP Hessian estimates ( B = MN ) • • P -values based on two-sided t -test • • SP Hessian estimates based on true gradient values M = 1 M = 2 N = 40,000 N = 20,000 P -value ( no ARNs ) ( ARNs ) 0.0084 0.0071 0.018 16
Concluding Remarks Concluding Remarks • Fisher information matrix is a central quantity in data analysis and parameter estimation –Measures information in data relative to quantities being estimated –Applications in confidence bounds, prediction error bounds, experimental design, Bayesian analysis, etc. • Direct computation of information matrix in general nonlinear problems usually impossible • Described Monte Carlo approach for computing matrix in arbitrarily complex (nonlinear) estimation problems: • Replaces detailed analytical analysis with computational Replaces detailed analytical analysis with computational • power via resampling power via resampling • Easy to implement, but may be computationally demanding Easy to implement, but may be computationally demanding • 17
Recommend
More recommend