differential programming
play

Differential Programming Gabriel Peyr www.numerical-tours.com C O - PowerPoint PPT Presentation

Differential Programming Gabriel Peyr www.numerical-tours.com C O L E N O R M A L E S U P R I E U R E https://mathematical-coffees.github.io Organized by : Mrouane Debbah & Gabriel Peyr Optimization Deep Learning Optimal


  1. Differential Programming Gabriel Peyré www.numerical-tours.com É C O L E N O R M A L E S U P É R I E U R E

  2. https://mathematical-coffees.github.io Organized by : Mérouane Debbah & Gabriel Peyré Optimization Deep Learning Optimal Transport Quantum computing Compressed Sensing Artificial intelligence Mean field games Topos Alexandre Gramfort, INRIA Yves Achdou, Paris 6 Frédéric Magniez, CNRS and Paris 7 Olivier Grisel (INRIA) Daniel Bennequin, Paris 7 Edouard Oyallon, CentraleSupelec Olivier Guéant, Paris 1 Marco Cuturi, ENSAE Gabriel Peyré, CNRS and ENS Iordanis Kerenidis, CNRS and Paris 7 Jalal Fadili, ENSICaen Joris Van den Bossche (INRIA) Guillaume Lecué, CNRS and ENSAE

  3. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter

  4. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2

  5. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2 Super-resolution: f ( x, · ) degradation y observation θ unknown image

  6. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) Medical imaging registration: y class probabilities θ 4 θ 1 y x θ 3 θ 2 f ( · , θ ) x di ff eomorphism Super-resolution: f ( x, · ) degradation y observation θ unknown image

  7. Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` `

  8. Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` ` Nesterov / heavy-ball (quasi)-Newton Many generalization: Stochastic / incremental methods Proximal splitting (non-smooth E ) . . .

  9. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1).

  10. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ?

  11. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n .

  12. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970]

  13. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970] This algorithm is reverse mode automatic di ff erentiation Seppo Linnainmaa

  14. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1

  15. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1

  16. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2

  17. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2 ∂ g ( x ) = A 0 × ( A 1 × ( A 2 × . . . × ( A R − 2 × ( A R − 1 × A R )) . . . )) Backward n 1 n 2 n R − 1 n R O ( n 2 ) n 0 n 1 n R − 2 n R − 1 Complexity: Rn 2

  18. Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

  19. Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2

  20. Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2 X def. L ( x R +1 , y ) = log exp( x R +1 ,i ) − x R +1 ,i y i Logistic loss: i (classification) e x R +1 r x R +1 L ( x R +1 , y ) = i e x R +1 ,i � y P

  21. Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

  22. Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E )

  23. Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E ) x r +1 = ρ ( A r x r + b r ) Example: deep neural network r x r E = A > r M r r A r E = M r x > ∀ r = R, . . . , 0 , def. = ρ 0 ( A r x r + b r ) � r x r +1 E M r r r b r E = M r 1

  24. Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

  25. Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ

  26. Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ Take home message: for complicated computational architectures, you do not want to do the computation/implementation by hand.

  27. Computational Graph

  28. Computational Graph Computer program ⇔ directed acyclic graph ⇔ linear ordering of nodes ( θ r ) r computing ` function ` ( ✓ 1 , . . . , ✓ M ) forward θ 4 for r = M + 1 , . . . , R g 4 θ 2 θ r = g r ( θ Parents( r ) ) input θ 3 output g 3 return θ R θ 1 θ 5 g 5

Recommend


More recommend