Differential Programming Gabriel Peyré www.numerical-tours.com É C O L E N O R M A L E S U P É R I E U R E
https://mathematical-coffees.github.io Organized by : Mérouane Debbah & Gabriel Peyré Optimization Deep Learning Optimal Transport Quantum computing Compressed Sensing Artificial intelligence Mean field games Topos Alexandre Gramfort, INRIA Yves Achdou, Paris 6 Frédéric Magniez, CNRS and Paris 7 Olivier Grisel (INRIA) Daniel Bennequin, Paris 7 Edouard Oyallon, CentraleSupelec Olivier Guéant, Paris 1 Marco Cuturi, ENSAE Gabriel Peyré, CNRS and ENS Iordanis Kerenidis, CNRS and Paris 7 Jalal Fadili, ENSICaen Joris Van den Bossche (INRIA) Guillaume Lecué, CNRS and ENSAE
Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter
Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2
Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2 Super-resolution: f ( x, · ) degradation y observation θ unknown image
Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) Medical imaging registration: y class probabilities θ 4 θ 1 y x θ 3 θ 2 f ( · , θ ) x di ff eomorphism Super-resolution: f ( x, · ) degradation y observation θ unknown image
Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` `
Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` ` Nesterov / heavy-ball (quasi)-Newton Many generalization: Stochastic / incremental methods Proximal splitting (non-smooth E ) . . .
The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1).
The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ?
The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n .
The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970]
The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970] This algorithm is reverse mode automatic di ff erentiation Seppo Linnainmaa
Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1
Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1
Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2
Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2 ∂ g ( x ) = A 0 × ( A 1 × ( A 2 × . . . × ( A R − 2 × ( A R − 1 × A R )) . . . )) Backward n 1 n 2 n R − 1 n R O ( n 2 ) n 0 n 1 n R − 2 n R − 1 Complexity: Rn 2
Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .
Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2
Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2 X def. L ( x R +1 , y ) = log exp( x R +1 ,i ) − x R +1 ,i y i Logistic loss: i (classification) e x R +1 r x R +1 L ( x R +1 , y ) = i e x R +1 ,i � y P
Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .
Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E )
Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E ) x r +1 = ρ ( A r x r + b r ) Example: deep neural network r x r E = A > r M r r A r E = M r x > ∀ r = R, . . . , 0 , def. = ρ 0 ( A r x r + b r ) � r x r +1 E M r r r b r E = M r 1
Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .
Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ
Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ Take home message: for complicated computational architectures, you do not want to do the computation/implementation by hand.
Computational Graph
Computational Graph Computer program ⇔ directed acyclic graph ⇔ linear ordering of nodes ( θ r ) r computing ` function ` ( ✓ 1 , . . . , ✓ M ) forward θ 4 for r = M + 1 , . . . , R g 4 θ 2 θ r = g r ( θ Parents( r ) ) input θ 3 output g 3 return θ R θ 1 θ 5 g 5
Recommend
More recommend