Automatic Differentiation of programs and its applications to Scientific Computing Laurent Hasco¨ et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Manchester, april 7 th , 2005 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 1 / 67
Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 2 / 67
So you need derivatives ?... Given a program P computing a function F R m R n F : I → I X �→ Y we want to build a program that computes the derivatives of F . Specifically, we want the derivatives of the dependent, i.e. some variables in Y , with respect to the independent, i.e. some variables in X . Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 3 / 67
Which derivatives do you want? Derivatives come in various shapes and flavors: � � ∂ y j Jacobian Matrices: J = ∂ x i Directional or tangent derivatives, differentials: dY = ˙ Y = J × dX = J × ˙ X Gradients: � � ∂ y When n = 1 output : gradient = J = ∂ x i t × J When n > 1 outputs: gradient = Y Higher-order derivative tensors Taylor coefficients Intervals ? Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 4 / 67
Divided Differences Given ˙ X , run P twice, and compute ˙ Y Y = P ( X + ε ˙ X ) − P ( X ) ˙ ε Pros: immediate; no thinking required ! Cons: approximation; what ε ? ⇒ Not so cheap after all ! Optimization wants inexpensive and accurate derivatives. ⇒ Let’s go for exact, analytic derivatives ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 5 / 67
AD Example: analytic tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67
AD Example: analytic tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67
AD Example: analytic tangent differentiation by Program transformation • SUBROUTINE FOO(v1, v1d, v2, v2d, v4, v4d, p1) REAL v1d,v2d,v3d,v4d REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Just inserts “differentiated instructions” into FOO Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67
Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 7 / 67
Take control away! We differentiate programs. But control ⇒ non-differentiability! Freeze the current control: ⇒ the program becomes a simple sequence of instructions ⇒ AD differentiates these sequences: CodeList 1 Diff(CodeList 1) Control 1: Control 1 CodeList 2 Diff(CodeList 2) Program Diff(Program) Control N: Control N CodeList N Diff(CodeList N) Caution: the program is only piecewise differentiable ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 8 / 67
Computer Programs as Functions Identify sequences of instructions { I 1 ; I 2 ; . . . I p − 1 ; I p ; } with composition of functions. Each simple instruction I k : v4 = v3 + v2/v3 R q → I R q where is a function f k : I The output v4 is built from the input v2 and v3 All other variable are passed unchanged Thus we see P : { I 1 ; I 2 ; . . . I p − 1 ; I p ; } as f = f p ◦ f p − 1 ◦ · · · ◦ f 1 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 9 / 67
Using the Chain Rule f = f p ◦ f p − 1 ◦ · · · ◦ f 1 We define for short: W 0 = X and W k = f k ( W k − 1 ) The chain rule yields: f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 10 / 67
Tangent mode and Reverse mode Full J is expensive and often useless. We’d better compute useful projections of J. tangent AD : Y = f ′ ( X ) . ˙ ˙ 1 ( W 0 ) . ˙ X = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . f ′ X reverse AD : X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . . f ′ t p − 1 ( W p − 2 ) . f ′ t p ( W p − 1 ) . Y Evaluate both from right to left: ⇒ always matrix × vector Theoretical cost is about 4 times the cost of P Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 11 / 67
Costs of Tangent and Reverse AD [ ( ) R m → I R n F : I m inputs ] Gradient n outputs Tangent J costs m ∗ 4 ∗ P using the tangent mode Good if m < = n J costs n ∗ 4 ∗ P using the reverse mode Good if m >> n (e.g n = 1 in optimization) Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 12 / 67
Back to the Tangent Mode example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Elementary Jacobian matrices: 1 1 1 1 f ′ ( X ) = ... ... 1 2 0 1 − p 1 ∗ v 2 p 1 0 0 1 v 2 v 3 3 v 3 = 2 ∗ ˙ ˙ v 1 v 3 ∗ (1 − p 1 ∗ v 2 / v 2 v 4 = ˙ ˙ 3 ) + ˙ v 2 ∗ p 1 / v 3 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 13 / 67
Tangent Mode example continued Tangent AD keeps the structure of P : ... v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d*(1-p1*v2/(v3*v3)) + v2d*p1/v3 v4 = v3 + p1*v2/v3 ... Differentiated instructions inserted into P ’s original control flow. Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 14 / 67
Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 15 / 67
Multi-directional mode and Jacobians If you want ˙ Y = f ′ ( X ) . ˙ X for the same X and several ˙ X either run the tangent differentiated program several times, evaluating f several times. or run a “Multi-directional” tangent once, evaluating f once. Same for X = f ′ t ( X ) . Y for several Y . In particular, multi-directional tangent or reverse is good to get the full Jacobian. Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 16 / 67
Sparse Jacobians with seed matrices When sparse Jacobian, use “seed matrices” to propagate fewer ˙ X or Y Multi-directional tangent mode: a b 1 a b c 1 c × = d 1 d e f g 1 e f g Multi-directional reverse mode: a b � 1 � a � � 1 c c b × = 1 1 d e f d g e f g Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 17 / 67
Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 18 / 67
Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67
Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I p − 2 ; I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Restore W p − 2 before I p − 2 ; W = f ′ t p − 1 ( W p − 2 ) * W ; Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67
Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I 1 ; ... I p − 2 ; I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Restore W p − 2 before I p − 2 ; W = f ′ t p − 1 ( W p − 2 ) * W ; ... Restore W 0 before I 1 ; W = f ′ t 1 ( W 0 ) * W ; X = W ; Instructions differentiated in the reverse order ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67
Reverse mode: graphical interpretation I I I I I 1 2 3 p-2 p-1 I p I time p-1 I I 2 1 Bottleneck: memory usage (“Tape”). Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 20 / 67
Back to the example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Transposed Jacobian matrices: 1 0 1 2 p 1 1 1 f ′ t ( X ) = ... v 3 ... 1 1 − p 1 ∗ v 2 0 v 2 3 1 0 v 2 = v 2 + v 4 ∗ p 1 / v 3 ... v 1 = v 1 + 2 ∗ v 3 v 3 = 0 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 21 / 67
Recommend
More recommend