Neural Networks with Cheap Differential Operators Ricky T. Q. Chen, David Duvenaud
Differential Operators • Want to compute operators such as divergence : d ∂ f i ( x ) ∑ ∇ ⋅ f = f : ℝ d → ℝ d ∂ x i is a neural net. where i =1 • Solving PDEs • Fitting SDEs • Finding fixed points • Continuous normalizing flows
Automatic Differentiation (AD) Reverse-mode AD gives cheap vector-Jacobian products: ∂ f 1 ( x ) –––––– –––––– v 1 d ∂ x [ dx f ( x ) ] = ∂ f i ( x ) d ∑ v T v i = ⋮ ∂ x ∂ f d ( x ) i =1 –––––– –––––– v d ∂ x • For full Jacobian, need separate passes d • In general, Jacobian diagonal has the same cost as the full Jacobian! • We restrict architecture to allow one-pass diagonal computations.
HollowNets Allow e ffi cient computation of dimension-wise derivatives of order k: with only k backward passes, regardless of dimension. Example: D k =1 Jacobian Jacobian diagonal dim f ( x ) =
HollowNet Architecture HollowNets are composed of two sub-networks: • Hidden units which don’t depend on their respective input: h i = c i ( x − i ) • Output units depend only on their respective hidden and input: f i ( x ) = τ i ([ x i , h i ])
HollowNet Jacobians Can get exact dimension- wise derivatives by disconnecting some dependencies in backward pass. i.e. detach in PyTorch or stop_gradient in TensorFlow.
HollowNet Jacobians Can factor Jacobian into: • A diagonal matrix (dimension-wise dependencies). • A hollow matrix (all interactions). d ∂ ∂ ∂ = + diagonal + hollow dx f = ∂ x τ ( x , h ) ∂ x h ( x ) ∂ h τ ( x , h )
Application I: Finding Fixed Points Root finding problems can be solved using Jacobi-Newton: ( f ( x ) = 0) − 1 f ( x ) x t +1 = x t − [ D dim f ( x ) ] x t +1 = x t − f ( x ) • Same solution with faster convergence. • We applied to implicit ODE solvers for solving sti ff equations.
Application II: Continuous Normalizing Flows • Transforms distributions through an ODE: • Change in density given by divergence: = tr ( d dx f ( x ) ) = d log p ( x , t ) d ∑ [ D dim f ( x ) ] i dt i =1
Learning Stochastic Diff Eqs • Fokker-Planck describes density change using and : D 2 D dim dim d i =1 [ − ( D dim f ) p − ( ∇ p ) ⊙ f + ( D 2 2 diag ( g ) 2 ⊙ ( D dim ∇ p ) ] i ∂ p ( t , x ) dim diag ( g )) p + 2( D dim diag ( g )) ⊙ ( ∇ p ) + 1 ∑ = ∂ t
Takeaways • Dimension-wise derivatives are costly for general functions. • Restricting to hollow Jacobians gives cheap diagonal grads. • Useful for PDEs, SDEs, normalizing flows, and optimization.
Recommend
More recommend