Automatic Differentiation of programs and its applications to - PowerPoint PPT Presentation

Automatic Differentiation of programs and its applications to Scientific Computing Laurent Hasco¨ et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Manchester, april 7 th , 2005 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 1 / 67

Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 2 / 67

So you need derivatives ?... Given a program P computing a function F R m R n F : I → I X �→ Y we want to build a program that computes the derivatives of F . Specifically, we want the derivatives of the dependent, i.e. some variables in Y , with respect to the independent, i.e. some variables in X . Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 3 / 67

Which derivatives do you want? Derivatives come in various shapes and flavors: � � ∂ y j Jacobian Matrices: J = ∂ x i Directional or tangent derivatives, differentials: dY = ˙ Y = J × dX = J × ˙ X Gradients: � � ∂ y When n = 1 output : gradient = J = ∂ x i t × J When n > 1 outputs: gradient = Y Higher-order derivative tensors Taylor coefficients Intervals ? Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 4 / 67

Divided Differences Given ˙ X , run P twice, and compute ˙ Y Y = P ( X + ε ˙ X ) − P ( X ) ˙ ε Pros: immediate; no thinking required ! Cons: approximation; what ε ? ⇒ Not so cheap after all ! Optimization wants inexpensive and accurate derivatives. ⇒ Let’s go for exact, analytic derivatives ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 5 / 67

AD Example: analytic tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67

AD Example: analytic tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67

AD Example: analytic tangent differentiation by Program transformation • SUBROUTINE FOO(v1, v1d, v2, v2d, v4, v4d, p1) REAL v1d,v2d,v3d,v4d REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Just inserts “differentiated instructions” into FOO Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67

Take control away! We differentiate programs. But control ⇒ non-differentiability! Freeze the current control: ⇒ the program becomes a simple sequence of instructions ⇒ AD differentiates these sequences: CodeList 1 Diff(CodeList 1) Control 1: Control 1 CodeList 2 Diff(CodeList 2) Program Diff(Program) Control N: Control N CodeList N Diff(CodeList N) Caution: the program is only piecewise differentiable ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 8 / 67

Computer Programs as Functions Identify sequences of instructions { I 1 ; I 2 ; . . . I p − 1 ; I p ; } with composition of functions. Each simple instruction I k : v4 = v3 + v2/v3 R q → I R q where is a function f k : I The output v4 is built from the input v2 and v3 All other variable are passed unchanged Thus we see P : { I 1 ; I 2 ; . . . I p − 1 ; I p ; } as f = f p ◦ f p − 1 ◦ · · · ◦ f 1 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 9 / 67

Using the Chain Rule f = f p ◦ f p − 1 ◦ · · · ◦ f 1 We define for short: W 0 = X and W k = f k ( W k − 1 ) The chain rule yields: f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 10 / 67

Tangent mode and Reverse mode Full J is expensive and often useless. We’d better compute useful projections of J. tangent AD : Y = f ′ ( X ) . ˙ ˙ 1 ( W 0 ) . ˙ X = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . f ′ X reverse AD : X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . . f ′ t p − 1 ( W p − 2 ) . f ′ t p ( W p − 1 ) . Y Evaluate both from right to left: ⇒ always matrix × vector Theoretical cost is about 4 times the cost of P Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 11 / 67

Costs of Tangent and Reverse AD [ ( ) R m → I R n F : I m inputs ] Gradient n outputs Tangent J costs m ∗ 4 ∗ P using the tangent mode Good if m < = n J costs n ∗ 4 ∗ P using the reverse mode Good if m >> n (e.g n = 1 in optimization) Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 12 / 67

Back to the Tangent Mode example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Elementary Jacobian matrices:   1   1 1 1   f ′ ( X ) = ...    ...     1 2 0      1 − p 1 ∗ v 2 p 1 0 0 1 v 2 v 3 3 v 3 = 2 ∗ ˙ ˙ v 1 v 3 ∗ (1 − p 1 ∗ v 2 / v 2 v 4 = ˙ ˙ 3 ) + ˙ v 2 ∗ p 1 / v 3 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 13 / 67

Tangent Mode example continued Tangent AD keeps the structure of P : ... v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d*(1-p1*v2/(v3*v3)) + v2d*p1/v3 v4 = v3 + p1*v2/v3 ... Differentiated instructions inserted into P ’s original control flow. Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 14 / 67

Multi-directional mode and Jacobians If you want ˙ Y = f ′ ( X ) . ˙ X for the same X and several ˙ X either run the tangent differentiated program several times, evaluating f several times. or run a “Multi-directional” tangent once, evaluating f once. Same for X = f ′ t ( X ) . Y for several Y . In particular, multi-directional tangent or reverse is good to get the full Jacobian. Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 16 / 67

Sparse Jacobians with seed matrices When sparse Jacobian, use “seed matrices” to propagate fewer ˙ X or Y Multi-directional tangent mode:       a b 1 a b c 1 c        ×  =       d 1 d     e f g 1 e f g Multi-directional reverse mode:   a b � 1 � a � � 1 c c b   ×  =   1 1 d e f d g  e f g Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 17 / 67

Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67

Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I p − 2 ; I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Restore W p − 2 before I p − 2 ; W = f ′ t p − 1 ( W p − 2 ) * W ; Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67

Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I 1 ; ... I p − 2 ; I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Restore W p − 2 before I p − 2 ; W = f ′ t p − 1 ( W p − 2 ) * W ; ... Restore W 0 before I 1 ; W = f ′ t 1 ( W 0 ) * W ; X = W ; Instructions differentiated in the reverse order ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67

Reverse mode: graphical interpretation I I I I I 1 2 3 p-2 p-1 I p I time p-1 I I 2 1 Bottleneck: memory usage (“Tape”). Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 20 / 67

Back to the example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Transposed Jacobian matrices:   1 0  1 2  p 1 1 1   f ′ t ( X ) = ...   v 3  ...   1 1 − p 1 ∗ v 2   0     v 2  3 1 0 v 2 = v 2 + v 4 ∗ p 1 / v 3 ... v 1 = v 1 + 2 ∗ v 3 v 3 = 0 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 21 / 67

Automatic Differentiation of programs and its applications to - PowerPoint PPT Presentation

Automatic Differentiation of programs and its applications to Scientific Computing Laurent Hasco et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Manchester, april 7 th , 2005 Manchester, april 7 th , 2005 Laurent Hasco

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &

Differentiation Differentiation stems from beliefs about differences among learners, how they

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

4.4. Vertical Differentiation Matilde Machado Industrial Organization- Matilde Machado Vertical

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

Automatic Differentiation Tools for FreeFem++ Workshop FreeFem++ Sylvain Auliac (

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,

A Simply Typed -Calculus of Forward Automatic Differentiation Oleksandr Manzyuk National

0 Towards Polyhedral Automatic Differentiation uckelheim 1,2 Navjot Kukreja 1 Jan H December

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Automatic Differentiation by Program Transformation Laurent Hasco et INRIA Sophia-Antipolis,

Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco et

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Virtual Elements for the Stokes problem L. Beiro da Veiga in collaboration with: P .

Design of the LBNF Beamline Jim Hylen, for the DUNE collaboration Fermilab

Computational Geometry and Computer Algebra Sylvain Lazard INRIA Nancy Joint work with C.

Protocol Stack and Layers Summary CSC 249 March 29, 2018 1 Chapter 6: Summary q Data link

Adjoint approach to optimization using automatic differentiation (AD) Praveen. C

Power law networks Social and Technological Networks Rik Sarkar University of Edinburgh, 2019.

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

C++ C+ + is an object-oriented extension of C C was designed by Dennis Ritchie at Bell

Automatic Differentiation of programs and its applications to - PowerPoint PPT Presentation

Automatic Differentiation of programs and its applications to Scientific Computing Laurent Hasco et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Manchester, april 7 th , 2005 Manchester, april 7 th , 2005 Laurent Hasco

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &amp;

Differentiation Differentiation stems from beliefs about differences among learners, how they

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

4.4. Vertical Differentiation Matilde Machado Industrial Organization- Matilde Machado Vertical

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

Automatic Differentiation Tools for FreeFem++ Workshop FreeFem++ Sylvain Auliac (

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Jan H uckelheim,

A Simply Typed -Calculus of Forward Automatic Differentiation Oleksandr Manzyuk National

0 Towards Polyhedral Automatic Differentiation uckelheim 1,2 Navjot Kukreja 1 Jan H December

CSC421/2516 Lecture 3: Automatic Differentiation &amp; Distributed Representations Jimmy Ba

Automatic Differentiation by Program Transformation Laurent Hasco et INRIA Sophia-Antipolis,

Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco et

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Virtual Elements for the Stokes problem L. Beiro da Veiga in collaboration with: P .

Design of the LBNF Beamline Jim Hylen, for the DUNE collaboration Fermilab

Computational Geometry and Computer Algebra Sylvain Lazard INRIA Nancy Joint work with C.

Protocol Stack and Layers Summary CSC 249 March 29, 2018 1 Chapter 6: Summary q Data link

Adjoint approach to optimization using automatic differentiation (AD) Praveen. C

Power law networks Social and Technological Networks Rik Sarkar University of Edinburgh, 2019.

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

C++ C+ + is an object-oriented extension of C C was designed by Dennis Ritchie at Bell

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba