Structure preservation in (some) deep learning architectures Brynjulf Owren Department of Mathematical Sciences, NTNU, Trondheim, Norway LMS-Bath Symposium – 2020 Joint work with: Martin Benning, Elena Celledoni, Matthias Ehrhardt, Christian Etmann, Carola-Bibiane Schönlieb and Ferdia Sherry 1 / 31
Main sources for this talk • Benning, Martin; Celledoni, Elena; Ehrhardt, Matthias J.; Owren, Brynjulf; Schönlieb, Carola-Bibiane, Deep Learning as Optimal Control Problems: Models and Numerical Methods J. Comput. Dyn. 6 (2019), no. 2, 171–198. • Elena Celledoni, Matthias J. Ehrhardt, Christian Etmann, Robert I McLachlan, Brynjulf Owren, Carola-Bibiane Schönlieb, Ferdia Sherry, Structure preserving deep learning , arXiv:2006.03364 (June 2020) 2 / 31
Neural networks as discrete dynamical system Neural network layers: φ k : X k × Θ k → X k + 1 , Θ k : Parameter space of layer k X k The k th feature space The full neural network Ψ : X × Θ → Y ( x , θ ) �→ z K can then be defined via the iteration z 0 = x z k + 1 = φ k ( z k , θ k ) , k = 0 , . . . , K − 1 , Extra final layer may be needed: η : X K × Θ K → Y . In this talk, X k = X for all k . 3 / 31
Training the neural network Training data: ( x n , y n ) N n = 1 ⊂ X × Y Training the network amounts to minimising the loss function � N � E ( θ ) = 1 � min L n (Ψ( x n , θ )) + R ( θ ) , N θ ∈ Θ n = 1 where • L n ( y ) : Y → R ∞ is the loss for a specific data point • R : Θ → R ∞ acts as a regulariser which penalises and constrains unwanted solutions. We can define the loss over a batch of N data points in terms of the final layer as N E ( z ; θ ) = 1 � L n ( η ( z n ) , θ ) + R ( θ ) N n = 1 4 / 31
ResNet model (He et al. (2016)) Ψ : X × Θ → X , Ψ( x , θ ) = z K given by the iteration z 0 = x z k + 1 = z k + σ ( A k z k + b k ) , k = 0 , . . . , K − 1 , y = η ( w T z K + µ ) • σ is a nonlinear activation function, a scalar function acting element-wise on vectors. • θ k = ( A k , b k ) , k ≤ K − 1. θ K = ( w , µ ) . The ResNet layers can be seen as a time stepper for the ODE z = σ ( A ( t ) z + b ( t )) , t ∈ [ 0 , T ] ˙ It is the explicit Euler method with stepsize h = 1. 5 / 31
Activations – examples σ 1 ( x ) = tanh x σ 2 ( x ) = max ( 0 , x ) , (RELU) 1 '(x)=1-tanh 2 (x) 1 (x)=tanh(x) 1 1 0.8 0.5 0.6 0 0.4 -0.5 0.2 -1 0 -4 -2 0 2 4 -4 -2 0 2 4 2 (x)=max(0,x) 2 '(x)=Heaviside(x) 4 1 0.8 3 0.6 2 0.4 1 0.2 0 0 -4 -2 0 2 4 -4 -2 0 2 4 6 / 31
The continuous optimal control problem – summarised � N � E ( θ, z ) = 1 � min L n ( z n ( T )) + R ( θ ) N ( θ, z ) ∈ Θ ×X N n = 1 such that z n = f ( z n , θ ( t )) , ˙ z n ( 0 ) = x n , n = 1 , . . . , N . 7 / 31
Training as an Optimal Control Problem The first order optimality conditions can be phrased as a Hamiltonian Boundary Value Problem (Benning et al. (2020)). Define H ( z , p ; θ ) = � p , f ( z , p ; θ ) � Solve z = ∂ H p = − ∂ H 0 = ∂ H ˙ ∂ p , ˙ ∂ z , ∂θ . with boundary conditions � p ( T ) = ∂ L � z ( 0 ) = x , � ∂ z � t = T For ResNet, f ( z , p ; θ ) = σ ( A ( t ) z + b ( t )) , and we shall discuss other alternative vector fields f . 8 / 31
Solving the HBVP Standard procedure: Initial guess θ ( 0 ) while not converged z = f ( z ; θ ( i ) ) to get z 1 , . . . , z K , z k = φ ( z k − 1 ) Sweep forward ˙ p = − Df ( z ) T p to obtain ∇ θ E Backprop on ˙ Update by some descent method e.g. θ ( i + 1 ) = θ ( i ) − τ ∇ θ E ( θ ( i ) ) • Chen et al (2018) suggest to use a black-box solver. Obtain z ( T ) and then do ( z ( t ) , p ( t )) backwards in time simultaneously to save memory usage. • Problematic for various reasons. No explicit solver satisfying first order optimality conditions + stability issues. • Gholami et al (2019) amend problem by a checkpointing method so only forward sweeps through feature spaces. Again: first order optimality is not so clear 9 / 31
DTO vs OTD Two options 1 DTO. Discretise the forward ODE ( ˙ z = f ( z ; θ ) ) by some numerical method φ . Then solve the discrete optimisation problem, based on the gradients ∇ θ k E ( z K ; θ K ) . 2 OTD. Solve the Hamiltonian boundary value problem by a numerical method ¯ φ : ( z k , p k ) �→ ( φ ( z k ) , p k + 1 ) and compute ∂θ ( z k , θ k ) T p k + 1 for each k . ∂φ Theorem (Benning et al 2020, Sanz-Serna 2015) DTO and OTD are equivalent if the overall method ¯ φ for the Hamiltonian boundary value problem preserves quadratic invariants (a.k.a. symplectic). That is, ∇ θ k E ( z K ; θ K ) = ∂φ ∂θ ( z k , θ k ) T p k + 1 10 / 31
An illustration 11 / 31
Generalisation mode – Forward problem Once the network has been trained, the parameters θ ( t ) are known. Generalisation (the forward problem) becomes a non-autonomous initial value problem z = ¯ ˙ f ( t , z ) := f ( z ; θ ( t )) , z ( 0 ) = x . - Arguably, one may ask for good “stability properties" for the forward problem. Haber & Ruthotto (2017), Zhang & Schaeffer (2020). - Stability may also be desired in “backward time", Chang et al. (2018). What is our freedom in choosing good models? - Restrict parameter space Θ ( A skew-symmetric, negative definite, manifold-valued,. . . ) - Alter the structure of the vector field f (Hamiltonian, dissipative, measure preserving,. . . ) - Apply integrator with good stability properties 12 / 31
Notions of stability • Linear stability analysis (Haber and Ruthotto). Nonlinear vector field f ( t , z ) look at spectrum of J ( t , z ) := ∂ f ∂ z ( t , z ) , Re λ i ≤ 0 Works only locally and only with autonomous vector fields. • Nonlinear stability analysis, look at norm contractivity/growth � z 2 ( t ) − z 1 ( t ) � ≤ C ( t ) � z 2 ( 0 ) − z 1 ( 0 ) � Such conditions can be ensured by imposing Lipschitz type conditions. E.g. for inner product spaces ν ∈ R � f ( t , z 2 ) − f ( t , z 1 ) , z 2 − z 1 � ≤ ν � z 2 − z 1 � 2 2 , ∀ z 1 , z 2 , t ∈ [ 0 , T ] ⇒ � z 2 ( t ) − z 1 ( t ) � ≤ e ν t � z 2 ( 0 ) − z 1 ( 0 ) � 13 / 31
Example of a stability result (Celledoni et al. (2020)) We consider for simplicity the ODE model z = − A ( t ) T σ ( A ( t ) z + b ( t )) = f ( t , z ) , ˙ z = −∇ z V with V = γ ( A ( t ) z + b ( t )) 1 where γ ′ = σ Here ˙ Theorem 1 Let V ( t , z ) be twice differentiable and convex in the second argument. Then the vector field f ( t , z ) = −∇ V ( t , z ) satisfies a one-sided Lipschitz condition with ν ≤ 0 . 2 Suppose that σ ( s ) is absolutely continuous and 0 ≤ σ ′ ( s ) ≤ 1 a.e. in R . Then the one-sided Lipschitz condition holds for any A ( t ) and b ( t ) with − µ 2 ∗ ≤ ν σ ≤ 0 where µ ∗ = min µ ( t ) and where µ ( t ) is the smallest singular t value of A ( t ) . In particular ν σ = − µ 2 ∗ is obtained when σ ( s ) = s . 14 / 31
Hamiltonian architectures Chang et al. (2018) Let H ( t , z , p ) = T ( t , p ) + V ( t , z ) Let γ i : R → R be such that γ ′ i ( t ) = σ i ( t ) , i = 1 , 2 and set T ( t , p ) = γ 1 ( A 1 ( t ) p + b 1 ( t )) 1 , V ( t , z ) = γ 2 ( A 2 ( t ) z + b 2 ( t )) 1 where 1 = ( 1 , . . . , 1 ) T . This leads to models of the form z = ∂ p H = A 1 ( t ) T σ 1 ( A 1 ( t ) p + b 1 ( t )) ˙ p = − ∂ z H = − A 2 ( t ) T σ 2 ( A 2 ( t ) z + b 2 ( t )) ˙ 15 / 31
Two particular Hamiltonian cases 1 A simple case is obtained by choosing σ 1 ( s ) := s , A 1 ( t ) ≡ I , b 1 ( t ) ≡ 0 and σ 2 ( s ) := σ ( s ) which after eliminating p yields the second order ODE z = − ∂ z V = − A ( t ) T σ ( A ( t ) z + b ( t )) ¨ 2 A second example z = A ( t ) T σ ( A ( t ) p + b ( t )) ˙ p = − A ( t ) T σ ( A ( t ) z + b ( t )) ˙ 16 / 31
Non-autonomous Hamiltonian problems Autonomous problems • Two important geometric properties • The flow preserves the Hamiltonian • The flow is symplectic • Numerical schemes can be symplectic or energy preserving, excellent long time behaviour Non-autonomous Hamiltonian problems • The situation is less clear, at least two ways to interpret the dynamics 1 ’Autonomise’ by adding time as dependent variable (contact manifold). A preserved two-form can be introduced ω = dp ∧ dq − dH ∧ dt but the Hamiltonian is not preseved along the flow 2 Extend system by adding time and a conjugate momentum variable p t . Define extended Hamiltonian K ( q , p , t , p t ) = H ( q , p , t ) + p t and symplectic form Ω = dp ∧ dq + dp t ∧ dt 17 / 31
The extended system p = − ∂ z H , ˙ z = ∂ p H , ˙ ˙ t = 1 , p t = − ∂ t H ˙ • An obvious strategy would be to study the dynamics of the extended autonomous Hamiltonian system. • Unfortunately, it does not give a lot of information • Any level set of K is unbounded • Chang et al (2018) report good numerical results with this type of model, I am not aware of any theoretical justification • Asorey et al. (1983) contains a number of results for the relations between the dynamics on the contact manifold and the extended manifold, [more work to be done in this direction] • LO Jay (2020), Marthinsen & O (2016) provide conditions on numerical integrators to be canonical in the non-autonomous case 18 / 31
Recommend
More recommend