Modern Computational Statistics Lecture 4: Numerical Integration Cheng Zhang School of Mathematical Sciences, Peking University September 23, 2019
Overview 2/30 ◮ Statistical inference often depends on intractable integrals � I ( f ) = Ω f ( x ) dx ◮ This is especially true in Bayesian statistics, where a posterior distribution is usually non-trivial. ◮ In some situations, the likelihood itself may depend on intractable integrals so frequentist methods would also require numerical integration ◮ In this lecture, we start by discussing some simple numerical methods that can be easily used in low dimensional problems ◮ Next, we will discuss several Monte Carlo strategies that could be implemented even when the dimension is high
Newton-Cˆ otes Quadrature 3/30 ◮ Consider a one-dimensional integral of the form � b I ( f ) = a f ( x ) dx ◮ A common strategy for approximating this integral is to use a tractable approximating function ˜ f ( x ) that can be integrated easily ◮ We typically constrain the approximating function to agree with f on a grid of points: x 1 , x 2 , . . . , x n
Newton-Cˆ otes Quadrature 4/30 ◮ Newton-Cˆ otes methods use equally-spaced grids ◮ The approximating function is a polynomial ◮ The integral then is approximated with a weighted sum as follows n ˆ � I = w i f ( x i ) i =1 ◮ In its simplest case, we can use the Riemann rule by partitioning the interval [ a, b ] into n subintervals of length h = b − a n ; then n − 1 ˆ � I L = h f ( a + ih ) i =0 This is obtained using a piecewise constant function ˜ f that matches f at the left points of each subinterval
Newton-Cˆ otes Quadrature 5/30 ◮ Alternatively, the approximating function could agree with the integrand at the right or middle point of each subinterval n n − 1 f ( a + ( i + 1 ˆ � ˆ � I R = h f ( a + ih ) , I M = h 2) h ) i =1 i =0 ◮ In either case, the approximating function is a zero-order polynomial ◮ To improve the approximation, we can use the trapzoidal rule by using a piecewise linear function that agrees with f ( x ) at both ends of subintervals n − 1 I = h f ( x i ) + h ˆ � 2 f ( a ) + h 2 f ( b ) i =1
Newton-Cˆ otes Quadrature 6/30 ◮ We would further improve the approximation by using higher order polynomials ◮ Simpson’s rule uses a quadratic approximation over each subinterval � x i +1 f ( x ) dx ≈ x i +1 − x i � f ( x i ) + 4 f ( x i + x i +1 � ) + f ( x i +1 ) 6 2 x i ◮ In general, we can use any polynomial of degree k
Gaussian Quadrature 7/30 ◮ Newton-Cˆ otes rules require equally spaced grids ◮ With a suitably flexible choice of n + 1 nodes, x 0 , x 1 , . . . , x n , and corresponding weights, A 0 , A 1 , . . . , A n , n � A i f ( x i ) i =0 gives the exact integration for all polynomials with degree less than or equal to 2 n + 1 ◮ This is called Gaussian quadrature, which is especially � b useful for the following type of integrals a f ( x ) w ( x ) dx where w ( x ) is a nonnegative function and � b a x k w ( x ) dx < ∞ for all k ≥ 0
Orthogonal Functions 8/30 ◮ In general, for squared integrable functions, � b f ( x ) 2 w ( x ) dx ≤ ∞ a denoted as f ∈ L 2 w, [ a,b ] , we define the inner product as � b � f, g � w, [ a,b ] = f ( x ) g ( x ) w ( x ) dx a where f, g ∈ L 2 w, [ a,b ] ◮ We said two functions to be orthogonal if � f, g � w, [ a,b ] = 0. If f and g are also scaled so that � f, f � w, [ a,b ] = 1, � g, g � w, [ a,b ] = 1, then f and g are orthonormal
Orthogonal Polynomials 9/30 ◮ We can define a sequence of orthogonal polynomials by a recursive rule T k +1 ( x ) = ( α k +1 + β k +1 x ) T k ( x ) − γ k +1 T k − 1 ( x ) ◮ Example: Chebyshev polynomials (first kind). T 0 ( x ) = 1 , T 1 ( x ) = x T n +1 ( x ) = 2 xT n ( x ) − T n − 1 ( x ) 1 ◮ T n ( x ) are orthogonal with respect to w ( x ) = 1 − x 2 and √ [ − 1 , 1] � 1 1 T n ( x ) T m ( x ) √ 1 − x 2 dx = 0 , ∀ n � = m − 1
Orthogonal Polynomials 10/30 ◮ In general orthogonal polynomials are note unique since � f, g � = 0 implies � cf, dg � = 0 ◮ To make the orthogonal polynomial unique, we can use the following standarizations ◮ make the polynomial orthonormal: � f, f � = 0 ◮ set the leading coefficient of T j ( x ) to 1 ◮ Orthogonal polynomials form a basis for L 2 w, [ a,b ] so any function in this space can be written as ∞ � f ( x ) = a n T n ( x ) n =0 � f,T n � where a n = � T n ,T n �
Gaussian Quadrature 11/30 ◮ Let { T n ( x ) } ∞ n =0 be a sequence of orthogonal polynomials with respect to w on [ a, b ]. ◮ Denote the n + 1 roots of T n +1 ( x ) by a < x 0 < x 1 < . . . < x n < b. ◮ We can find weights A 1 , A 2 , . . . , A n +1 such that � b n � P ( x ) w ( x ) dx = A i P ( x i ) , ∀ deg( P ) ≤ 2 n + 1 a i =0 ◮ To do that, we first show: there exists weights A 1 , A 2 , . . . , A n +1 such that � b n � ∀ deg( P ) < n + 1 P ( x ) w ( x ) dx = A i P ( x i ) , a i =0
Gaussian Quadrature 12/30 ◮ Sketch of proof. We only need to satisfy � b n x k w ( x ) dx = � A i x k i , ∀ k = 0 , 1 , . . . , n a i =0 This leads to a system of linear equations 1 1 . . . 1 A 0 I 0 x 0 x 1 . . . x n A 1 I 1 = . . . . . . . . . . . . . . . . . . x n x n x n . . . A n I n 0 1 n � b a x k w ( x ) dx . The determinant of the where I k = coefficient matrix is a Vandermonde determinant, and is non-zero since x i � = x j , ∀ i � = j
Gaussian Quadrature 13/30 ◮ Now we show that the above Gaussian Quadrature can be exact for polynomials of degree ≤ 2 n + 1 ◮ Let P ( x ) be a polynomial with deg( P ) ≤ 2 n + 1, there exist polynomials g ( x ) and r ( x ) such that P ( x ) = g ( x ) T n +1 ( x ) + r ( x ) with deg( g ) ≤ n, deg( r ) ≤ n , Therefore, � b � b n � P ( x ) w ( x ) dx = r ( x ) w ( x ) dx = A i r ( x i ) a a i =0 n � = A i P ( x i ) i =0
Monte Carlo Method 14/30 ◮ We now discuss the Monte Carlo method mainly in the context of statistical inference ◮ As before, suppose we are interested in estimating � b I ( h ) = a h ( x ) dx ◮ If we can draw iid samples, x (1) , x (2) , . . . , x ( n ) uniformly from ( a, b ), we can approximate the integral as n I n = ( b − a ) 1 ˆ � h ( x ( i ) ) n i =1 ◮ Note that we can think about the integral as � b 1 ( b − a ) h ( x ) · b − adx a 1 where b − a is the density of Uniform( a, b )
Monte Carlo Method 15/30 ◮ In general, we are interested in integrals of the form � X h ( x ) f ( x ) dx , where f ( x ) is a probability density function ◮ Analogous to the above argument, we can approximate this integral (or expectation) by drawing iid samples x (1) , x (2) , . . . , x ( n ) from the density f ( x ) and then n I = 1 ˆ � h ( x ( i ) ) n i =1 ◮ Based on the law of large numbers, we know that p ˆ lim I n − → I n →∞ ◮ And based on the central limit theorem √ n (ˆ σ 2 = V ar( h ( X )) I n − I ) → N (0 , σ 2 ) ,
Example: estimating π 16/30 [ − 1 , 1] 2 h ( x ) · 1 ◮ Let h ( x ) = 1 B (0 , 1) ( x ), then π = 4 � 4 dx ◮ Monte Carlo estimate of π n I n = 4 ˆ � 1 B (0 , 1) ( x ( i ) ) n i =1 x ( i ) ∼ Uniform([ − 1 , 1] 2 )
Example: estimating π 17/30
Monte Carlo vs Quadrature 18/30 ◮ Convergence rate for Monte Carlo: O ( n − 1 / 2 ) � σ � | ˆ p I n − I | ≤ √ nδ ≥ 1 − δ, ∀ δ often slower than quadrature methods ( O ( n − 2 ) or better) ◮ However, the convergence rate of Monte Carlo does not depend on dimensionality ◮ On the other hand, quadrature methods are difficult to extend to multidimensional problems, because of the curse of dimensionality. The actual convergence rate becomes O ( n − k/d ), for any order k method in dimension d ◮ This makes Monte Carlo strategy very attractive for high dimensional problems
Exact Simulation 19/30 ◮ Monte Carlo methods require sampling a set of points chosen randomly from a probability distribution ◮ For simple distribution f ( x ) whose inverse cumulative distribution functions (CDF) exists, we can sampling x from f as follows x = F − 1 ( u ) , u ∼ Uniform(0 , 1) where F − 1 is the inverse CDF of f ◮ Proof. p ( a ≤ x ≤ b ) = p ( F ( a ) ≤ u ≤ F ( b )) = F ( b ) − F ( a )
Examples 20/30 ◮ Exponential distribution: f ( x ) = θ exp( − θx ). The CDF is � a θ exp( − θx ) = 1 − exp( − θa ) F ( a ) = 0 therefore, x = F − 1 ( u ) = − 1 θ log(1 − u ) ∼ f ( x ). Since 1 − u also follows the uniform distribution, we often use x = − 1 θ log( u ) instead 2 π exp( − x 2 1 ◮ Normal distribution: f ( x ) = 2 ). Box-Muller √ Transform � − 2 log U 1 cos 2 πU 2 X = � − 2 log U 1 sin 2 πU 2 Y = where U 1 ∼ Uniform(0 , 1) , U 2 ∼ Uniform(0 , 1)
Recommend
More recommend