what does backpropagation compute
play

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse - PowerPoint PPT Presentation

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J er ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28 Plan Motivation: There is something that we do not understand


  1. What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J´ erˆ ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28

  2. Plan Motivation: There is something that we do not understand in backpropagation for deep learning. 2 / 28

  3. Plan Motivation: There is something that we do not understand in backpropagation for deep learning. Nonsmooth analysis is not really compatible with calculus. 2 / 28

  4. Plan Motivation: There is something that we do not understand in backpropagation for deep learning. Nonsmooth analysis is not really compatible with calculus. Contribution: Conservative set valued fields. Analytic, geometric and algorithmic properties. 2 / 28

  5. Backpropagation Automatic differentiation (AD, 70s): 3 / 28

  6. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H 3 / 28

  7. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) 3 / 28

  8. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) ∇ f T × ( J G × J H ). Forward mode of AD: ( ∇ f T × J G ) × J H . Backward mode of AD: 3 / 28

  9. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) ∇ f T × ( J G × J H ). Forward mode of AD: ( ∇ f T × J G ) × J H . Backward mode of AD: Backpropagation: Backward AD for neural network training. It computes gradient (provided that everybody is smooth). 3 / 28

  10. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p 4 / 28

  11. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. 4 / 28

  12. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. F θ ( x ) = z L = φ L ( W L φ L − 1 ( W L − 1 ( . . . φ 1 ( W 1 x + b 1 ) ) + b L − 1 ) + b L ) 4 / 28

  13. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. F θ ( x ) = z L = φ L ( W L φ L − 1 ( W L − 1 ( . . . φ 1 ( W 1 x + b 1 ) ) + b L − 1 ) + b L ) i =1 in R p × R p L , loss ℓ : R p L × R p L → R + . Training set: { ( x i , y i ) } n n n 1 1 � � min J ( θ ) := ℓ ( F θ ( x i ) , y i ) = J i ( θ ) . n n θ i =1 i =1 4 / 28

  14. Backpropagation and learning Stochastic (minibatch) gradient algorithm: Given ( I k ) k ∈ N iid, uniform on { 1 , . . . , n } , ( α k ) k ∈ N positive, iterate, θ k +1 = θ k − α k ∇ J I k ( θ k ) . Backpropagation: Backward mode of automatic differentiation used to compute ∇ J i 5 / 28

  15. Backpropagation and learning Stochastic (minibatch) gradient algorithm: Given ( I k ) k ∈ N iid, uniform on { 1 , . . . , n } , ( α k ) k ∈ N positive, iterate, θ k +1 = θ k − α k ∇ J I k ( θ k ) . Backpropagation: Backward mode of automatic differentiation used to compute ∇ J i Profusion of numerical tools: e.g. Tensorflow, Pytorch. Democratized the usage of these models. Goes beyond neural nets (differentiable programming). 5 / 28

  16. Nonsmooth activations Positive part: relu ( t ) = max { 0 , t } , Less straightforward examples: Max pooling in convolutional networks. knn grouping layers, farthest point subsampling layers. Qi et. al. 2017. PointNet++: Deep Hierarchical Feature Learning on point Sets in a Metric Space. Sorting layers. Anil et. al. 2019. Sorting Out Lipschitz Function Approximation. ICML. 6 / 28

  17. 2.0 2 relu' relu 1.5 1 1.0 0 0.5 abs' abs 0.0 1 2 1 0 1 2 2 1 0 1 2 x x 2 6 leaky_relu' relu6' leaky_relu relu6 4 1 2 0 0 2 1 0 1 2 2 1 0 1 2 x x Nonsmooth backpropagation Set relu ′ (0) = 0 and implement the chain rule of smooth calculus. ( f ◦ g ) ′ = g ′ × f ′ ◦ g . 7 / 28

  18. Nonsmooth backpropagation Set relu ′ (0) = 0 and implement the chain rule of smooth calculus. ( f ◦ g ) ′ = g ′ × f ′ ◦ g . Tensorflow examples: 2.0 2 relu' relu 1.5 1 1.0 0 0.5 abs' abs 0.0 1 2 1 0 1 2 2 1 0 1 2 x x 2 6 leaky_relu' relu6' leaky_relu relu6 4 1 2 0 0 2 1 0 1 2 2 1 0 1 2 x x 7 / 28

  19. AD acts on programs, not on functions relu 2( t ) = relu ( − t ) + t = relu ( t ) relu 3( t ) = 1 2( relu ( t ) + relu2 ( t )) = relu ( t ) . 2.0 2.0 relu2' relu3' relu2 relu3 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2 1 0 1 2 2 1 0 1 2 x x Known from AD litterature ( e.g. Griewank 08, Kakade & Lee 2018). 8 / 28

  20. Derivative of zero at 0 zero ( t ) = relu 2( t ) − relu ( t ) = 0 . 1.00 zero' zero 0.75 0.50 0.25 0.00 2 1 0 1 2 x 9 / 28

  21. AD acts on programs, not on functions Derivative of sine at 0: sin ′ = cos . 1.0 2 mysin' mysin 0.5 1 0.0 0 0.5 sin' sin 1.0 1 2 1 0 1 2 2 1 0 1 2 x x 1.0 mysin3' 2 mysin3 0.5 1 0.0 0 0.5 mysin2' 1 mysin2 1.0 2 1 0 1 2 2 1 0 1 2 x x 10 / 28

  22. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . 11 / 28

  23. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, convex, n J ( θ ) = 1 � J i ( θ ) n i =1 ∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] ∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

  24. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, n J ( θ ) = 1 � J i ( θ ) n i =1 ∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

  25. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, auto differentiation. n J ( θ ) = 1 � J i ( θ ) n i =1 �∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

  26. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, auto differentiation. n J ( θ ) = 1 � J i ( θ ) n i =1 �∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , Discrepancy: Analyse: θ k +1 = θ k − α k ( v k + ǫ k ), v k ∈ ∂ J ( θ k ), ( ǫ i ) i ∈ N zero mean (martingale increments). (Davis et. al. 2018. Stochastic subgradient method converges on tame functions. FOCM.) Implement: θ k +1 = θ k − α k D I k ( θ k ) 11 / 28

  27. Question Smooth: Nonsmooth: num num J P J P diff autodiff diff autodiff ∇ J D num ∂J D A mathematical model for “nonsmooth automatic differentiation”? 12 / 28

  28. Outline 1. Conservative set valued field 2. Properties of conservative fields 3. Consequences for deep learning 13 / 28

  29. What is a derivative? 14 / 28

  30. What is a derivative? Linear operator: derivative : C 1 ( R ) C 0 ( R ) �→ f ′ �→ f 14 / 28

  31. What is a derivative? Linear operator: derivative : C 1 ( R ) C 0 ( R ) �→ f ′ �→ f Notions of subgradients inherited from calculus of variation follow the “operator” view. 14 / 28

Recommend


More recommend