Derivative Evaluation by Automatic Differentiation of Programs - PowerPoint PPT Presentation

Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco¨ et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Ecole d’´ et´ e CEA-EDF-INRIA, Juillet 2005 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 1 / 88

Outline Introduction 1 Formalization 2 Reverse AD 3 Alternative formalizations 4 Memory issues in Reverse AD: Checkpointing 5 Multi-directional 6 Reverse AD for Optimization 7 AD for Sensitivity to Uncertainties 8 Some AD Tools 9 10 Static Analyses in AD tools 11 The TAPENADE AD tool 12 Validation of AD results 13 Expert-level AD 14 Conclusion Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 2 / 88

So you need derivatives ?... Given a program P computing a function F R m R n F : I → I X �→ Y we want to build a program that computes the derivatives of F . Specifically, we want the derivatives of the dependent, i.e. some variables in Y , with respect to the independent, i.e. some variables in X . Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 3 / 88

Which derivatives do you want? Derivatives come in various shapes and flavors: � � ∂ y j Jacobian Matrices: J = ∂ x i Directional or tangent derivatives, differentials: dY = ˙ Y = J × dX = J × ˙ X Gradients: � � ∂ y When n = 1 output : gradient = J = ∂ x i t × J When n > 1 outputs: gradient = Y Higher-order derivative tensors Taylor coefficients Intervals ? Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 4 / 88

Divided Differences Given ˙ X , run P twice, and compute ˙ Y Y = P ( X + ε ˙ X ) − P ( X ) ˙ ε Pros: immediate; no thinking required ! Cons: approximation; what ε ? ⇒ Not so cheap after all ! Most applications require inexpensive and accurate derivatives. ⇒ Let’s go for exact, analytic derivatives ! Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 5 / 88

Automatic Differentiation Augment program P to make it compute the analytic derivatives P: a = b*T(10) + c The differentiated program must somehow compute: P’: da = db*T(10) + b*dT(10) + dc How can we achieve this? AD by Overloading AD by Program transformation Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 6 / 88

AD by overloading Tools: adol-c , adtageo ,... Few manipulations required: DOUBLE → ADOUBLE ; link with provided overloaded +,-,* ,. . . Easy extension to higher-order, Taylor series, intervals, . . . but not so easy for gradients. Anecdote?: real → complex x = a*b → (x , dx) = (a*b-da*db , a*db+da*b) Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 7 / 88

AD by Program transformation Tools: adifor , taf , tapenade ,... Complex transformation required: Build a new program that computes the analytic derivatives explicitly. Requires a compiler-like, sophisticated tool PARSING, 1 ANALYSIS, 2 DIFFERENTIATION, 3 REGENERATION 4 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 8 / 88

Overloading vs Transformation Overloading is versatile, Transformed programs are efficient: Global program analyses are possible and most welcome ! The compiler can optimize the generated program. Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 9 / 88

Example: Tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 10 / 88

Example: Tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 10 / 88

Example: Tangent differentiation by Program transformation • SUBROUTINE FOO(v1, v1d, v2, v2d, v4, v4d, p1) REAL v1d,v2d,v3d,v4d REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Just inserts “differentiated instructions” into FOO Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 10 / 88

Dealing with the Programs’ Control Programs contain control: discrete ⇒ non-differentiable. if (x <= 1.0) then printf("x too small"); else { y = 1.0; while (y <= 10.0) { y = y*x; x = x+0.5; } } Not differentiable for x=1.0 Not differentiable for x=2.9221444 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 12 / 88

Take control away! We differentiate programs. But control ⇒ non-differentiability! Freeze the current control: For one given control, the program becomes a simple list of instructions ⇒ differentiable: printf("x too small"); y = 1.0; y = y*x; x = x+0.5; AD differentiates these lists of instructions: CodeList 1 Diff(CodeList 1) Control 1: Control 1 CodeList 2 Diff(CodeList 2) Program Diff(Program) Control N: Control N CodeList N Diff(CodeList N) Caution: the program is only piecewise differentiable ! Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 13 / 88

Computer Programs as Functions Identify sequences of instructions { I 1 ; I 2 ; . . . I p − 1 ; I p ; } with composition of functions. Each simple instruction I k : v4 = v3 + v2/v3 R q → I R q where is a function f k : I The output v4 is built from the input v2 and v3 All other variable are passed unchanged Thus we see P : { I 1 ; I 2 ; . . . I p − 1 ; I p ; } as f = f p ◦ f p − 1 ◦ · · · ◦ f 1 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 14 / 88

Using the Chain Rule We see program P as: f = f p ◦ f p − 1 ◦ · · · ◦ f 1 We define for short: W 0 = X and W k = f k ( W k − 1 ) The chain rule yields: f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 15 / 88

The Jacobian Program f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) translates immediately into a program that computes the Jacobian J: I 1 ; /* W = f 1 ( W ) */ I 2 ; /* W = f 2 ( W ) */ ... I p ; /* W = f p ( W ) */ Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 16 / 88

The Jacobian Program f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) translates immediately into a program that computes the Jacobian J: W = X ; J = f ′ 1 ( W ) ; I 1 ; /* W = f 1 ( W ) */ J = f ′ 2 ( W ) ∗ J ; I 2 ; /* W = f 2 ( W ) */ ... J = f ′ p ( W ) ∗ J ; I p ; /* W = f p ( W ) */ Y = W ; Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 16 / 88

Tangent mode and Reverse mode Full J is expensive and often useless. We’d better compute useful projections of J. tangent AD : Y = f ′ ( X ) . ˙ ˙ 1 ( W 0 ) . ˙ X = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . f ′ X reverse AD : X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . . f ′ t p − 1 ( W p − 2 ) . f ′ t p ( W p − 1 ) . Y Evaluate both from right to left: ⇒ always matrix × vector Theoretical cost is about 4 times the cost of P Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 17 / 88

Costs of Tangent and Reverse AD [ ( ) R m → I R n F : I m inputs ] Gradient n outputs Tangent J costs m ∗ 4 ∗ P using the tangent mode Good if m < = n J costs n ∗ 4 ∗ P using the reverse mode Good if m >> n (e.g n = 1 in optimization) Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 18 / 88

Back to the Tangent Mode example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Elementary Jacobian matrices:   1   1 1 1   f ′ ( X ) = ...    ...     1 2 0      1 − p 1 ∗ v 2 p 1 0 0 1 v 2 v 3 3 v 3 = 2 ∗ ˙ ˙ v 1 v 3 ∗ (1 − p 1 ∗ v 2 / v 2 v 4 = ˙ ˙ 3 ) + ˙ v 2 ∗ p 1 / v 3 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 19 / 88

Tangent Mode example continued Tangent AD keeps the structure of P : ... v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d*(1-p1*v2/(v3*v3)) + v2d*p1/v3 v4 = v3 + p1*v2/v3 ... Differentiated instructions inserted into P ’s original control flow. Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 20 / 88

Derivative Evaluation by Automatic Differentiation of Programs - PowerPoint PPT Presentation

Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Ecole d et e CEA-EDF-INRIA, Juillet 2005 Laurent Hasco et () Automatic

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

2. Theory of the Derivative 2.1 Tangent Lines 2.2 Definition of Derivative 2.3 Rates of Change

Some basic rules of differentiation R1(Constant Function Rule) The derivative of the function

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Derivative Function Math 132 Stewart 2.2 In Notes 2.1, we defined the derivative of a

Sobolev spaces Updated June 1, 2020 Plan 2 Outline: Weak derivative Relation to ordinary

Adjoint Derivative Computation Moritz Diehl and Carlo Savorgnan Adjoint Derivative Computation

Differentiation Differentiation stems from beliefs about differences among learners, how they

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &

4.4. Vertical Differentiation Matilde Machado Industrial Organization- Matilde Machado Vertical

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Differentiation Tools for FreeFem++ Workshop FreeFem++ Sylvain Auliac (

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

Kinds of Interfaces Semester 2, 2009 1 Interface Categories and Styles Basic Categories of

of Objects and Human Poses Maryam Daneshi, Konstantin Bayandin May 28 th , 2013 1 Agenda

Verb polysemy and frequency effects in thematic fit modeling Clayton Greenberg, Vera Demberg, and

Squeak and Croquet Please note: The following are just screenshots from the demo made with

Advanced Macroeconomics 10. Determinants of Total Factor Productivity Karl Whelan School of

2019 NCSEA Board of Directors Election Photo Michele Ahern Assistant Deputy Commissioner NYC

One slide - One Minute Presentation of Students, Lecturers and CAS Team At the Numerical Methods

Multi-Product Firms Jos e de Sousa and Isabelle Mejean Topics in International Trade

Derivative Evaluation by Automatic Differentiation of Programs - PowerPoint PPT Presentation

Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Ecole d et e CEA-EDF-INRIA, Juillet 2005 Laurent Hasco et () Automatic

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

2. Theory of the Derivative 2.1 Tangent Lines 2.2 Definition of Derivative 2.3 Rates of Change

Some basic rules of differentiation R1(Constant Function Rule) The derivative of the function

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Derivative Function Math 132 Stewart 2.2 In Notes 2.1, we defined the derivative of a

Sobolev spaces Updated June 1, 2020 Plan 2 Outline: Weak derivative Relation to ordinary

Adjoint Derivative Computation Moritz Diehl and Carlo Savorgnan Adjoint Derivative Computation

Differentiation Differentiation stems from beliefs about differences among learners, how they

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &amp;

4.4. Vertical Differentiation Matilde Machado Industrial Organization- Matilde Machado Vertical

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Differentiation Tools for FreeFem++ Workshop FreeFem++ Sylvain Auliac (

CSC321 Lecture 10: Automatic Differentiation Roger Grosse Roger Grosse CSC321 Lecture 10:

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

Kinds of Interfaces Semester 2, 2009 1 Interface Categories and Styles Basic Categories of

of Objects and Human Poses Maryam Daneshi, Konstantin Bayandin May 28 th , 2013 1 Agenda

Verb polysemy and frequency effects in thematic fit modeling Clayton Greenberg, Vera Demberg, and

Squeak and Croquet Please note: The following are just screenshots from the demo made with

Advanced Macroeconomics 10. Determinants of Total Factor Productivity Karl Whelan School of

2019 NCSEA Board of Directors Election Photo Michele Ahern Assistant Deputy Commissioner NYC

One slide - One Minute Presentation of Students, Lecturers and CAS Team At the Numerical Methods

Multi-Product Firms Jos e de Sousa and Isabelle Mejean Topics in International Trade

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &