An introduction to holonomic gradient method in statistics Akimichi - PowerPoint PPT Presentation

An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012

Items 1. First example: Airy-like function 2. Holonomic function and holonomic gradient method (HGM) 3. Another example: incomplete gamma function 4. Wishart distribution and hypergeometric function of a matrix argument 5. HGM for two-dimensional Wishart matrix 6. Pfaffian system for general dimension 7. Numerical experiments 1

• Coauthors on HGM: H.Hashiguchi, T.Koyama, H.Nakayama, K.Nishiyama, M.Noro, Y.Numata, K.Ohara, T.Sei, N.Takayama. • HGM proposed in “Holonomic gradient descent and its application to the Fisher-Bingham integral”, Advances in Applied Mathematics , 47, 639–658. N 3 OST 2 . 2011. • We have now about 7 manuscripts on HGM, mainly by Takayama group. • Wishart discussed in arXiv:1201.0472v3 2

Ex.1: Airy-like function An exercise problem by Nobuki Takayama on Sep.16, 2009, during “Kobe Gr¨ obner School”. Question: Let � ∞ e − t − xt 3 dt, A ( x ) = x > 0 . 0 Derive a differential equation satisfied by A ( x ) . Answer: 1 = (27 x 3 ∂ 2 x + 54 x 2 ∂ x + 6 x + 1) A ( x ) = 27 x 3 A �� ( x ) + 54 x 2 A � ( x ) + (6 x + 1) A ( x ) . 3

• Actually this question is pretty hard, even if you are told the answer. • I was struggling with this problem, wasting lots of papers and wondering why I was doing this exercise. • After one hour, I suddenly realized that this is indeed an important problem in statistics. 4

• We try to confirm the given answer. • First � ∞ � ∞ e − t − xt 3 dt = − t 3 e − t − xt 3 dt A � ( x ) = ∂ x 0 0 � ∞ t 6 e − t − xt 3 dt. A �� ( x ) = 0 • Hence 27 x 3 A �� ( x ) + 54 x 2 A � ( x ) + (6 x + 1) A ( x ) � ∞ (27 x 3 t 6 − 54 x 2 t 3 + 6 x + 1) e − t − xt 3 dt = 0 = 1 ?? 5

• Integration by parts: ∂ t e − t − xt 3 = − (1 + 3 xt 2 ) e − t − xt 3 . Hence te − t − xt 3 � ∞ � 0 = 0 � ∞ t (1 + 3 xt 2 ) e − t − xt 3 . = A ( x ) − 0 6

• Similarly, if you work hard and work out ( − 9 x 2 t 4 + 3 xt 2 + 6 xt − 1) e − t − xt 3 � ∞ � 1 = 0 by integration by parts ( how did you get this? ), you obtain the answer: � ∞ (27 x 3 t 6 − 54 x 2 t 3 + 6 x + 1) e − t − xt 3 dt = 1 . 0 Believe and try • ( − 9 x 2 t 4 + 3 xt 2 + 6 xt − 1) � + (9 x 2 t 4 − 3 xt 2 − 6 xt + 1)(1 + 3 xt 2 ) = ( − 36 x 2 t 3 + 6 xt + 6 x ) + (9 x 2 t 4 − 3 xt 2 − 6 xt + 1) + (27 x 3 t 6 − 9 x 2 t 4 − 18 x 2 t 3 + 3 xt 2 ) = 27 x 3 t 6 − 54 x 2 t 3 + 6 x + 1 ( OK! ) 7

• Actually Prof. Takayama (or one of his students) did this by computer, asking us to do this by hand( !) ← Integration algorithm for D -modules based on Gr¨ obner basis for D -modules 8

Is this exercise related to statistics? • Change the notation and let � ∞ e − x − θx 3 dx. A ( θ ) = 0 • Let 1 A ( θ ) e − x − θx 3 , f ( x ; θ ) = x, θ > 0 . • This is an exponential family with the su ffi cient statistic T ( x ) = x 3 . (We can absorb e − x into the base measure dx .) 9

• Therefore we are evaluating the normalizing constant and its derivatives of an exponential family. • We now know 1 = 27 θ 3 A �� ( θ ) + 54 θ 2 A � ( θ ) + (6 θ + 1) A ( θ ) . • Hence the Fisher information A �� ( θ ) is automatically obtained from A ( θ ) and A � ( θ ) . 10

• Suppose that we have independent observations x 1 , . . . , x n from f ( x ; θ ) . • Then the log likelihood � ( θ ) is written as n � x 3 � ( θ ) = − θ i − n log A ( θ ) i =1 and the likelihood equation is n i − nA � ( θ ) � x 3 0 = − A ( θ ) . i =1 • Can we numerically evaluate A ( θ ) and A � ( θ ) ? (Also A �� ( θ ) for Newton-Raphson?) 11

• For illustration, we use simple linear approximation (Euler method for numerical integration). A ( θ + ∆ θ ) . = A ( θ ) + ∆ θA � ( θ ) A � ( θ + ∆ θ ) . = A � ( θ ) + ∆ θA �� ( θ ) . • But from the differential equation we know 1 A �� ( θ ) = 27 θ 3 (1 − (6 θ + 1) A ( θ ) − 54 θ 2 A � ( θ )) . 12

• Punch line: if you keep numerical values of A ( θ ) , A � ( θ ) at one point θ , then you can compute these values at nearby θ + ∆ θ . • At each point, higher-order derivatives A �� ( θ ) , A �� ( θ ) , . . . , can be computed as needed. • Hence by numerically solving ODE, you can compute A ( θ ) and its derivatives at any point → “Holonomic Gradient Method” • For explanation we used Euler method, but in our actual implementation we use Runge-Kutta method to solve ODE. 13

Holonomic function and holonomic gradient method (HGM) Univariate homogeneous case: • A smooth function f is holonomic if f satisfies the following ODE 0 = h k ( x ) f ( k ) ( x ) + · · · + h 1 ( x ) f � ( x ) + h 0 ( x ) f ( x ) , where h k ( x ) , . . . , h 1 ( x ) , h 0 ( x ) are rational functions of x . 14

• Approximation   f ( x + ∆ x )   f � ( x + ∆ x )     .   .   .     f ( k − 1) ( x + ∆ x )     0 1 0 . . . 0   f ( x )     0 0 1 . . . 0       f � ( x )       .       = I k + ∆ x . . . .       .       .       0 0 . . . 0 1       f ( k − 1) ( x )     − h k − 1 ( x ) − h 0 ( x ) − h 1 ( x ) . . . h k ( x ) h k ( x ) h k ( x ) 15

Multivariate case (for simplicity let dim = 2 ) • A smooth f ( x, y ) is holonomic if for each of x and y , fixing the other variable arbitrarily, we have a holonomic function in x and y . • Namely, if there exist rational functions h 1 0 ( x, y ) , . . . , h 1 k 1 ( x, y ) , h 2 0 ( x, y ) , . . . , h 2 k 2 ( x, y ) in x, y such that k 1 � h 1 i ( x, y ) ∂ i x f ( x, y ) = 0 , i =0 k 2 � h 2 i ( x, y ) ∂ i y f ( x, y ) = 0 . i =0 16

• Consider ∂ r 1 x ∂ r 2 y f ( x, y ) . If r 1 ≥ k 1 or r 2 ≥ k 2 , we can always compute this by recursively applying the differential equations. • As in the univariate case, if we keep numerical values of ∂ i x ∂ j y f ( x, y ) in the range i = 0 , . . . , k 1 − 1 , j = 0 , . . . , k 2 − 1 , then we can always compute other higher-order derivatives. • We can also approximate ∂ i x ∂ j y f ( x + ∆ x, y + ∆ y ) by the values { ∂ i x ∂ j y f ( x, y ) } i =0 ,...,k 1 − 1 ,j =0 ,...,k 2 − 1 . 17

• We usually only need to keep a subset of { ∂ i x ∂ j y f ( x, y ) } i =0 ,...,k 1 − 1 ,j =0 ,...,k 2 − 1 in memory. The subset is given by the set of standard monomials obtained by the division algorithm based on a Gr¨ obner basis. 18

Which functions are holonomic (Zeilberger(1990))? • Polynomials and rational functions are holonomic. • exp( rational ) , log( rational ) are holonomic. • f, g : holonomic ⇒ f + g, f × g : holonomic • f ( x 1 , . . . , x m ) : holonomic � ⇒ f ( x 1 , . . . , x m ) dx m : holonomic • Restriction of a holonomic f to an a ffi ne subspace is holonomic. 19

• Holonomocity is also defined for generalized functions. • From the above properties, it is often easy to tell that a given function (such as the Airy-like function) is holonomic, i.e., it must satisfy a differential equation with rational function coe ffi cients. • The problem is to find the explicit form of the di ff erential equation. 20

Ex.2: Incomplete Gamma function • Consider incomplete Gamma function � x y α − 1 e − y dy, G ( x ) = α, x > 0 . 0 • From general theory, G ( x ) is holonomic. • We integrate in the opposite direction and change scale: � x � x (1 − y ( x − y ) α − 1 e − ( x − y ) dy = x α − 1 e − x x ) α − 1 e y dy 0 0 � 1 = x α e − x (1 − z ) α − 1 e xz dz 0 21

• Expand e xz into Taylor series (not clever?) � 1 ∞ 1 � G ( x ) = x α e − x k ! x k (1 − z ) α − 1 z k dz 0 k =0 ∞ Γ ( α ) Γ ( k + 1) � = x α e − x k ! Γ ( α + k + 1) x k k =0 ∞ = 1 (1) k � αx α e − x ( α + 1) k k ! x k k =0 ( a ) k = a ( a + 1) . . . ( a + k − 1) = 1 αx α e − x 1 F 1 (1; α + 1; x ) 22

• Confluent hypergeometric function 1 F 1 ∞ ( a ) k � ( c ) k k ! x k F = 1 F 1 ( a ; c ; x ) = k =0 • Differential equation (ODE) satisfied by F : xF �� ( x ) + ( c − x ) F � ( x ) − aF = 0 ( Comparison of coe ffi cients: a + k c + k k + ca + k c + k − k − a = 0) 23

• “Holonomic gradient method”: We obtain the value of F ( x + ∆ x ) from F ( x ) by solving the ODE. • If we know F � ( x ) , then we can do the following F ( x + ∆ x ) . = F ( x ) + ∆ xF � ( x ) • For the next step we also need to update F � ( x + ∆ x ) : F � ( x + ∆ x ) . = F � ( x ) + ∆ xF �� ( x ) 24

• Further we need F �� ( x ) , . . . . But note that at any x we have F �� ( x ) = − (( c − x ) F � ( x ) − aF ( x ))1 x. Hence we can keep only ( F ( x ) , F � ( x )) in memory. We can always compute higher-order derivatives from these two. • In summary, we can use the updating procedure      F ( x ) + ∆ xF � ( x )  F ( x )  . →  F � ( x ) F � ( x ) + ∆ xF �� ( x ) 25

An introduction to holonomic gradient method in statistics Akimichi - PowerPoint PPT Presentation

An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012 Items 1. First example: Airy-like function 2. Holonomic function and holonomic gradient method (HGM) 3. Another example: incomplete

Gr obner Bases and Holonomic Gradient Method Evaluation of A -Hypergeometric Polynomials .

Holonomic D -Modules, the Dixmie Conjecture and the Jacobian Conjecture Vladimir Bavula

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Riemann-Hilbert correspondence for irregular holonomic D -modules (joint work with Masaki

DD-finite functions in Sage 2 Using the package Computing beyond holonomic functions 3 Computing

Non-holonomic Planning Jane Li Assistant Professor Mechanical Engineering & Robotics

Weakly-unambiguous Parikh automata and their link with holonomic series Alin Bostan, Arnaud

New results on asymptotics of holonomic sequences Cyril Banderier CNRS, LIPN, Paris XIII

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Thermodynamics of 2+1 flavor QCD with the SF t X method based on the gradient flow SF t X method :

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Computer Architecture Summer 2020 Pipelining Tyler Bletsch Duke University Includes material

The mixed Higgs- R 2 inflationary model Alexei A. Starobinsky Landau Institute for Theoretical

Unit 5: Pipelining Load-use stalling Pipelined multi-cycle operations Control hazards

Gravitational waves Part I ICTP, 18-22 June 2018 Literature Some figures in these lectures are

Roots of Polynomials Under Repeated Differentiation Stefan Steinerberger UCLA/Caltech, October

Twisted N = 4 Super Yang-Mills Theory in Background Katsushi Ito Tokyo Institute of

An action principle for Vasilievs 4D equations Nicolas Boulanger Universit e de Mons,

Structures on Doubled Geometry Cecilia Albertsson (YITP, Kyoto U) arXiv:0907.nnnn [hep-th]

An introduction to holonomic gradient method in statistics Akimichi - PowerPoint PPT Presentation

An introduction to holonomic gradient method in statistics Akimichi Takemura, Univ. Tokyo September 5, 2012 Items 1. First example: Airy-like function 2. Holonomic function and holonomic gradient method (HGM) 3. Another example: incomplete

Gr obner Bases and Holonomic Gradient Method Evaluation of A -Hypergeometric Polynomials .

Holonomic D -Modules, the Dixmie Conjecture and the Jacobian Conjecture Vladimir Bavula

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Riemann-Hilbert correspondence for irregular holonomic D -modules (joint work with Masaki

DD-finite functions in Sage 2 Using the package Computing beyond holonomic functions 3 Computing

Non-holonomic Planning Jane Li Assistant Professor Mechanical Engineering &amp; Robotics

Weakly-unambiguous Parikh automata and their link with holonomic series Alin Bostan, Arnaud

New results on asymptotics of holonomic sequences Cyril Banderier CNRS, LIPN, Paris XIII

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Thermodynamics of 2+1 flavor QCD with the SF t X method based on the gradient flow SF t X method :

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Computer Architecture Summer 2020 Pipelining Tyler Bletsch Duke University Includes material

The mixed Higgs- R 2 inflationary model Alexei A. Starobinsky Landau Institute for Theoretical

Unit 5: Pipelining Load-use stalling Pipelined multi-cycle operations Control hazards

Gravitational waves Part I ICTP, 18-22 June 2018 Literature Some figures in these lectures are

Roots of Polynomials Under Repeated Differentiation Stefan Steinerberger UCLA/Caltech, October

Twisted N = 4 Super Yang-Mills Theory in Background Katsushi Ito Tokyo Institute of

An action principle for Vasilievs 4D equations Nicolas Boulanger Universit e de Mons,

Structures on Doubled Geometry Cecilia Albertsson (YITP, Kyoto U) arXiv:0907.nnnn [hep-th]

Non-holonomic Planning Jane Li Assistant Professor Mechanical Engineering & Robotics