An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012
Items 1. First example: Airy-like function 2. Holonomic function and holonomic gradient method (HGM) 3. Another example: incomplete gamma function 4. Wishart distribution and hypergeometric function of a matrix argument 5. HGM for two-dimensional Wishart matrix 6. Pfaffian system for general dimension 7. Numerical experiments 1
• Coauthors on HGM: H.Hashiguchi, T.Koyama, H.Nakayama, K.Nishiyama, M.Noro, Y.Numata, K.Ohara, T.Sei, N.Takayama. • HGM proposed in “Holonomic gradient descent and its application to the Fisher-Bingham integral”, Advances in Applied Mathematics , 47, 639–658. N 3 OST 2 . 2011. • We have now about 7 manuscripts on HGM, mainly by Takayama group. • Wishart discussed in arXiv:1201.0472v3 2
Ex.1: Airy-like function An exercise problem by Nobuki Takayama on Sep.16, 2009, during “Kobe Gr¨ obner School”. Question: Let � ∞ e − t − xt 3 dt, A ( x ) = x > 0 . 0 Derive a differential equation satisfied by A ( x ) . Answer: 1 = (27 x 3 ∂ 2 x + 54 x 2 ∂ x + 6 x + 1) A ( x ) = 27 x 3 A �� ( x ) + 54 x 2 A � ( x ) + (6 x + 1) A ( x ) . 3
• Actually this question is pretty hard, even if you are told the answer. • I was struggling with this problem, wasting lots of papers and wondering why I was doing this exercise. • After one hour, I suddenly realized that this is indeed an important problem in statistics. 4
• We try to confirm the given answer. • First � ∞ � ∞ e − t − xt 3 dt = − t 3 e − t − xt 3 dt A � ( x ) = ∂ x 0 0 � ∞ t 6 e − t − xt 3 dt. A �� ( x ) = 0 • Hence 27 x 3 A �� ( x ) + 54 x 2 A � ( x ) + (6 x + 1) A ( x ) � ∞ (27 x 3 t 6 − 54 x 2 t 3 + 6 x + 1) e − t − xt 3 dt = 0 = 1 ?? 5
• Integration by parts: ∂ t e − t − xt 3 = − (1 + 3 xt 2 ) e − t − xt 3 . Hence te − t − xt 3 � ∞ � 0 = 0 � ∞ t (1 + 3 xt 2 ) e − t − xt 3 . = A ( x ) − 0 6
• Similarly, if you work hard and work out ( − 9 x 2 t 4 + 3 xt 2 + 6 xt − 1) e − t − xt 3 � ∞ � 1 = 0 by integration by parts ( how did you get this? ), you obtain the answer: � ∞ (27 x 3 t 6 − 54 x 2 t 3 + 6 x + 1) e − t − xt 3 dt = 1 . 0 Believe and try • ( − 9 x 2 t 4 + 3 xt 2 + 6 xt − 1) � + (9 x 2 t 4 − 3 xt 2 − 6 xt + 1)(1 + 3 xt 2 ) = ( − 36 x 2 t 3 + 6 xt + 6 x ) + (9 x 2 t 4 − 3 xt 2 − 6 xt + 1) + (27 x 3 t 6 − 9 x 2 t 4 − 18 x 2 t 3 + 3 xt 2 ) = 27 x 3 t 6 − 54 x 2 t 3 + 6 x + 1 ( OK! ) 7
• Actually Prof. Takayama (or one of his students) did this by computer, asking us to do this by hand( !) ← Integration algorithm for D -modules based on Gr¨ obner basis for D -modules 8
Is this exercise related to statistics? • Change the notation and let � ∞ e − x − θx 3 dx. A ( θ ) = 0 • Let 1 A ( θ ) e − x − θx 3 , f ( x ; θ ) = x, θ > 0 . • This is an exponential family with the su ffi cient statistic T ( x ) = x 3 . (We can absorb e − x into the base measure dx .) 9
• Therefore we are evaluating the normalizing constant and its derivatives of an exponential family. • We now know 1 = 27 θ 3 A �� ( θ ) + 54 θ 2 A � ( θ ) + (6 θ + 1) A ( θ ) . • Hence the Fisher information A �� ( θ ) is automatically obtained from A ( θ ) and A � ( θ ) . 10
• Suppose that we have independent observations x 1 , . . . , x n from f ( x ; θ ) . • Then the log likelihood � ( θ ) is written as n � x 3 � ( θ ) = − θ i − n log A ( θ ) i =1 and the likelihood equation is n i − nA � ( θ ) � x 3 0 = − A ( θ ) . i =1 • Can we numerically evaluate A ( θ ) and A � ( θ ) ? (Also A �� ( θ ) for Newton-Raphson?) 11
• For illustration, we use simple linear approximation (Euler method for numerical integration). A ( θ + ∆ θ ) . = A ( θ ) + ∆ θA � ( θ ) A � ( θ + ∆ θ ) . = A � ( θ ) + ∆ θA �� ( θ ) . • But from the differential equation we know 1 A �� ( θ ) = 27 θ 3 (1 − (6 θ + 1) A ( θ ) − 54 θ 2 A � ( θ )) . 12
• Punch line: if you keep numerical values of A ( θ ) , A � ( θ ) at one point θ , then you can compute these values at nearby θ + ∆ θ . • At each point, higher-order derivatives A �� ( θ ) , A ��� ( θ ) , . . . , can be computed as needed. • Hence by numerically solving ODE, you can compute A ( θ ) and its derivatives at any point → “Holonomic Gradient Method” • For explanation we used Euler method, but in our actual implementation we use Runge-Kutta method to solve ODE. 13
Holonomic function and holonomic gradient method (HGM) Univariate homogeneous case: • A smooth function f is holonomic if f satisfies the following ODE 0 = h k ( x ) f ( k ) ( x ) + · · · + h 1 ( x ) f � ( x ) + h 0 ( x ) f ( x ) , where h k ( x ) , . . . , h 1 ( x ) , h 0 ( x ) are rational functions of x . 14
• Approximation f ( x + ∆ x ) f � ( x + ∆ x ) . . . f ( k − 1) ( x + ∆ x ) 0 1 0 . . . 0 f ( x ) 0 0 1 . . . 0 f � ( x ) . = I k + ∆ x . . . . . . 0 0 . . . 0 1 f ( k − 1) ( x ) − h k − 1 ( x ) − h 0 ( x ) − h 1 ( x ) . . . h k ( x ) h k ( x ) h k ( x ) 15
Multivariate case (for simplicity let dim = 2 ) • A smooth f ( x, y ) is holonomic if for each of x and y , fixing the other variable arbitrarily, we have a holonomic function in x and y . • Namely, if there exist rational functions h 1 0 ( x, y ) , . . . , h 1 k 1 ( x, y ) , h 2 0 ( x, y ) , . . . , h 2 k 2 ( x, y ) in x, y such that k 1 � h 1 i ( x, y ) ∂ i x f ( x, y ) = 0 , i =0 k 2 � h 2 i ( x, y ) ∂ i y f ( x, y ) = 0 . i =0 16
• Consider ∂ r 1 x ∂ r 2 y f ( x, y ) . If r 1 ≥ k 1 or r 2 ≥ k 2 , we can always compute this by recursively applying the differential equations. • As in the univariate case, if we keep numerical values of ∂ i x ∂ j y f ( x, y ) in the range i = 0 , . . . , k 1 − 1 , j = 0 , . . . , k 2 − 1 , then we can always compute other higher-order derivatives. • We can also approximate ∂ i x ∂ j y f ( x + ∆ x, y + ∆ y ) by the values { ∂ i x ∂ j y f ( x, y ) } i =0 ,...,k 1 − 1 ,j =0 ,...,k 2 − 1 . 17
• We usually only need to keep a subset of { ∂ i x ∂ j y f ( x, y ) } i =0 ,...,k 1 − 1 ,j =0 ,...,k 2 − 1 in memory. The subset is given by the set of standard monomials obtained by the division algorithm based on a Gr¨ obner basis. 18
Which functions are holonomic (Zeilberger(1990))? • Polynomials and rational functions are holonomic. • exp( rational ) , log( rational ) are holonomic. • f, g : holonomic ⇒ f + g, f × g : holonomic • f ( x 1 , . . . , x m ) : holonomic � ⇒ f ( x 1 , . . . , x m ) dx m : holonomic • Restriction of a holonomic f to an a ffi ne subspace is holonomic. 19
• Holonomocity is also defined for generalized functions. • From the above properties, it is often easy to tell that a given function (such as the Airy-like function) is holonomic, i.e., it must satisfy a differential equation with rational function coe ffi cients. • The problem is to find the explicit form of the di ff erential equation. 20
Ex.2: Incomplete Gamma function • Consider incomplete Gamma function � x y α − 1 e − y dy, G ( x ) = α, x > 0 . 0 • From general theory, G ( x ) is holonomic. • We integrate in the opposite direction and change scale: � x � x (1 − y ( x − y ) α − 1 e − ( x − y ) dy = x α − 1 e − x x ) α − 1 e y dy 0 0 � 1 = x α e − x (1 − z ) α − 1 e xz dz 0 21
• Expand e xz into Taylor series (not clever?) � 1 ∞ 1 � G ( x ) = x α e − x k ! x k (1 − z ) α − 1 z k dz 0 k =0 ∞ Γ ( α ) Γ ( k + 1) � = x α e − x k ! Γ ( α + k + 1) x k k =0 ∞ = 1 (1) k � αx α e − x ( α + 1) k k ! x k k =0 ( a ) k = a ( a + 1) . . . ( a + k − 1) = 1 αx α e − x 1 F 1 (1; α + 1; x ) 22
• Confluent hypergeometric function 1 F 1 ∞ ( a ) k � ( c ) k k ! x k F = 1 F 1 ( a ; c ; x ) = k =0 • Differential equation (ODE) satisfied by F : xF �� ( x ) + ( c − x ) F � ( x ) − aF = 0 ( Comparison of coe ffi cients: a + k c + k k + ca + k c + k − k − a = 0) 23
• “Holonomic gradient method”: We obtain the value of F ( x + ∆ x ) from F ( x ) by solving the ODE. • If we know F � ( x ) , then we can do the following F ( x + ∆ x ) . = F ( x ) + ∆ xF � ( x ) • For the next step we also need to update F � ( x + ∆ x ) : F � ( x + ∆ x ) . = F � ( x ) + ∆ xF �� ( x ) 24
• Further we need F �� ( x ) , . . . . But note that at any x we have F �� ( x ) = − (( c − x ) F � ( x ) − aF ( x ))1 x. Hence we can keep only ( F ( x ) , F � ( x )) in memory. We can always compute higher-order derivatives from these two. • In summary, we can use the updating procedure F ( x ) + ∆ xF � ( x ) F ( x ) . → F � ( x ) F � ( x ) + ∆ xF �� ( x ) 25
Recommend
More recommend