Differential Programming Gabriel Peyr www.numerical-tours.com C O - PowerPoint PPT Presentation

Differential Programming Gabriel Peyré www.numerical-tours.com É C O L E N O R M A L E S U P É R I E U R E

https://mathematical-coffees.github.io Organized by : Mérouane Debbah & Gabriel Peyré Optimization Deep Learning Optimal Transport Quantum computing Compressed Sensing Artificial intelligence Mean field games Topos Alexandre Gramfort, INRIA Yves Achdou, Paris 6 Frédéric Magniez, CNRS and Paris 7 Olivier Grisel (INRIA) Daniel Bennequin, Paris 7 Edouard Oyallon, CentraleSupelec Olivier Guéant, Paris 1 Marco Cuturi, ENSAE Gabriel Peyré, CNRS and ENS Iordanis Kerenidis, CNRS and Paris 7 Jalal Fadili, ENSICaen Joris Van den Bossche (INRIA) Guillaume Lecué, CNRS and ENSAE

Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter

Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2

Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2 Super-resolution: f ( x, · ) degradation y observation θ unknown image

Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) Medical imaging registration: y class probabilities θ 4 θ 1 y x θ 3 θ 2 f ( · , θ ) x di ff eomorphism Super-resolution: f ( x, · ) degradation y observation θ unknown image

Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` `

Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` ` Nesterov / heavy-ball (quasi)-Newton Many generalization: Stochastic / incremental methods Proximal splitting (non-smooth E ) . . .

The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1).

The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ?

The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n .

The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970]

The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970] This algorithm is reverse mode automatic di ff erentiation Seppo Linnainmaa

Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1

Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1

Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2

Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2 ∂ g ( x ) = A 0 × ( A 1 × ( A 2 × . . . × ( A R − 2 × ( A R − 1 × A R )) . . . )) Backward n 1 n 2 n R − 1 n R O ( n 2 ) n 0 n 1 n R − 2 n R − 1 Complexity: Rn 2

Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2

Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2 X def. L ( x R +1 , y ) = log exp( x R +1 ,i ) − x R +1 ,i y i Logistic loss: i (classification) e x R +1 r x R +1 L ( x R +1 , y ) = i e x R +1 ,i � y P

Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E )

Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E ) x r +1 = ρ ( A r x r + b r ) Example: deep neural network r x r E = A > r M r r A r E = M r x > ∀ r = R, . . . , 0 , def. = ρ 0 ( A r x r + b r ) � r x r +1 E M r r r b r E = M r 1

Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ

Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ Take home message: for complicated computational architectures, you do not want to do the computation/implementation by hand.

Computational Graph

Computational Graph Computer program ⇔ directed acyclic graph ⇔ linear ordering of nodes ( θ r ) r computing ` function ` ( ✓ 1 , . . . , ✓ M ) forward θ 4 for r = M + 1 , . . . , R g 4 θ 2 θ r = g r ( θ Parents( r ) ) input θ 3 output g 3 return θ R θ 1 θ 5 g 5

Differential Programming Gabriel Peyr www.numerical-tours.com C O - PowerPoint PPT Presentation

Differential Programming Gabriel Peyr www.numerical-tours.com C O L E N O R M A L E S U P R I E U R E https://mathematical-coffees.github.io Organized by : Mrouane Debbah & Gabriel Peyr Optimization Deep Learning Optimal

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

Tutorial: Differential Categories and Cartesian Differential Categories JS Pacaud Lemay FMCS

Differential equations Programming of Differential Equations A differential equation (ODE)

Differential equations Programming of Differential Equations A differential equation (ODE)

Differential expression analysis John Blischak Instructor DataCamp Differential Expression

Differential and Linear Cryptanalysis Lars R. Knudsen June 2014 L.R. Knudsen Differential and

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Differential forms in non-linear Cartesian differential categories Hayley Reid and Jonathan

differential schemes and differential algebraic varieties Dmitry Trushin Department of

Differential Attainment Differential Attainment Working Group Intended learning outcomes Define

Differential Privacy Techniques Beyond Differential Privacy Steven Wu Assistant Professor

1.3 Differential Equations as Mathematical Models a lesson for MATH F302 Differential Equations

15. Partial differential equations; double integrals 15.1. Partial differential equations. Recall

CS573 Data Privacy and Security Differential Privacy Real World Deployments Li Xiong

Differential Privacy (Part III) Approximate (or ( , ))-differential privacy

Differential Equation Axiomatization The Impressive Power of Differential Ghosts Andr e

QoE Estimation in Encrypted Traffic Bogdan Iacob Technical University of Munich Department of

Pion scattering and electro-production on nucleons in the resonance region in chiral quark models

1 ATM Logical Connect ions Vir t ual channel connect ion (VCC) A var iable-rat e, f

Multimedia Communications Spring 2006-07 Delay of Voice Traffic Over IP Netw orks Shahab Baqai

Virtual Desktop Infrastructure(VDI) discussion discussion AppsArea Open Meeting and APPSAWG

Convenience: All required software has been installed. ROS is a pain to install on non-

CSE 344 Section 3 [Srini] 1. Connecting to Azure and running queries 2. Nested Queries Connect

Clustering Samba with Zookeeper and Cassandra Richard Sharpe Outline What Im doing