robust pricing and hedging via neural sdes
play

Robust pricing and hedging via neural SDEs Lukasz Szpruch - PowerPoint PPT Presentation

Robust pricing and hedging via neural SDEs Lukasz Szpruch University of Edinburgh, The Alan Turing Institute, London joint work with: David Siska, Marc Sabate-Vilades @UoE Zan Zuric and Antoine Jacquier @ICL 1 / 45 Outline Robust pricing


  1. Generative modelling ◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν ”. We write T # µ = ν . ◮ Parametrise transport map T ( θ ), θ ∈ R p , e.g some network architecture or Heston model ◮ Seek θ � s.t T ( θ � ) # µ ≈ ν . ◮ Need to make the choice of the metric � � D ( T ( θ ) # µ, ν ) := sup � f ( x )( T ( θ ) # µ )( dx ) − f ( x ) ν ( dx ) � f ∈ K ◮ K could be set of options we want to calibrate to, could be neural network 10 / 45

  2. Generative modelling ◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν ”. We write T # µ = ν . ◮ Parametrise transport map T ( θ ), θ ∈ R p , e.g some network architecture or Heston model ◮ Seek θ � s.t T ( θ � ) # µ ≈ ν . ◮ Need to make the choice of the metric � � D ( T ( θ ) # µ, ν ) := sup � f ( x )( T ( θ ) # µ )( dx ) − f ( x ) ν ( dx ) � f ∈ K ◮ K could be set of options we want to calibrate to, could be neural network ◮ The modelling choices are 1. metric D 2. parametrisation of T 3. algorithm used for training!!! 10 / 45

  3. Generative modelling in finance Pros: ◮ Expressive and work in high dimensions ◮ By design data driven, adaptable to change in environment ◮ Provide new perspective on classical problems in finance 11 / 45

  4. Generative modelling in finance Pros: ◮ Expressive and work in high dimensions ◮ By design data driven, adaptable to change in environment ◮ Provide new perspective on classical problems in finance Cons: ◮ Parameters are not interpretable - black box approach ◮ Training algorithms are data hungry ◮ Models might be hard to work with e.g how to go fro Q to P ? ◮ A field largely empirical, lack of standardised benchmarks, lack of theoretical guarantees 11 / 45

  5. Robust pricing and hedging via neural SDEs 12 / 45

  6. Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p 13 / 45

  7. Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p ◮ Parametric model induces martingale measure Q ( θ ) 13 / 45

  8. Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p ◮ Parametric model induces martingale measure Q ( θ ) ◮ Input Data: prices of traded derivatives p ( Φ i ) M i =0 with corresponding payo ff s ( Φ i ) M i =0 13 / 45

  9. Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p ◮ Parametric model induces martingale measure Q ( θ ) ◮ Input Data: prices of traded derivatives p ( Φ i ) M i =0 with corresponding payo ff s ( Φ i ) M i =0 ◮ Output : θ ∗ such that p ( Φ i ) ≈ E Q ( Θ ∗ ) [ Φ i ] 13 / 45

  10. Robust Price bounds ◮ There are infinitely many models that are consistent with the market 14 / 45

  11. Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data 14 / 45

  12. Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup E [ Ψ ] and Q ∈ M E [ Ψ ] inf Q ∈ M ◮ Use duality theory to deduce (semi-static) hedging strategy 14 / 45

  13. Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup E [ Ψ ] and Q ∈ M E [ Ψ ] inf Q ∈ M ◮ Use duality theory to deduce (semi-static) hedging strategy ◮ The obtain bounds typically to wide to be of practical value 14 / 45

  14. Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup E [ Ψ ] and Q ∈ M E [ Ψ ] inf Q ∈ M ◮ Use duality theory to deduce (semi-static) hedging strategy ◮ The obtain bounds typically to wide to be of practical value ◮ Challenges: a) Incorporate prior information to restrict a search space M b) Design e ffi cient algorithms for computing price bounds and corresponding hedges 14 / 45

  15. Cla��ical Ri�k Model� Gene�a�i�e Model�� Ne��al SDE� Rob��� Finance� 15 / 45

  16. Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) 16 / 45

  17. Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) 16 / 45

  18. Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport 16 / 45

  19. Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. 16 / 45

  20. Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. ◮ See [Arribas et al., 2020] for Sig-SDEs (neural SDE in a signature feature space) 16 / 45

  21. Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. ◮ See [Arribas et al., 2020] for Sig-SDEs (neural SDE in a signature feature space) ◮ Neural SDEs are easy to work with e.g consistent change from Q to P . 16 / 45

  22. Neural SDEs i) Calibration to market prices Find model parameters θ ∗ such that model prices match market prices: M � θ ∗ ∈ arg min ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) . θ ∈ Θ i =1 17 / 45

  23. Neural SDEs i) Calibration to market prices Find model parameters θ ∗ such that model prices match market prices: M � θ ∗ ∈ arg min ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) . θ ∈ Θ i =1 ii) Robust pricing Find model parameters θ l , ∗ and θ u , ∗ which provide robust arbitrage-free price bounds for an illiquid derivative, subject to available market data: � M θ l , ∗ ∈ arg min θ ∈ Θ E Q ( θ ) [ Ψ ] , ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) = 0 , subject to i =1 M � θ u , ∗ ∈ arg max θ ∈ Θ E Q ( θ ) [ Ψ ] , ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) = 0 . subject to i =1 where ℓ : R × R → [0 , ∞ ) is a convex loss function such that min x ∈ R , y ∈ R ℓ ( x , y ) = 0. 17 / 45

  24. Stochastic Optimisation Let M = 1 and the loss function � � E Q ( θ ) [ Φ ] , p ( Φ ) h ( θ ) = ℓ . Then in the gradient step update we have � � E Q [ Φ ( X θ )] , p ( Φ ) E Q [ ∂ θ Φ ( X θ )] , ∂ θ h ( θ ) = ∂ x ℓ Since ℓ is typically not an identity function, a mini-batch estimator of ∂ θ h ( θ ), obtained by replacing Q with Q N given by � � E Q N [ Φ ( X θ )] , p ( Φ ) E Q N [ ∂ θ Φ ( X θ )] , ∂ θ h N ( θ ) := ∂ x ℓ is a biased estimator of ∂ θ h . 18 / 45

  25. Stochastic Optimisation Let M = 1 and the loss function � � E Q ( θ ) [ Φ ] , p ( Φ ) h ( θ ) = ℓ . Then in the gradient step update we have � � E Q [ Φ ( X θ )] , p ( Φ ) E Q [ ∂ θ Φ ( X θ )] , ∂ θ h ( θ ) = ∂ x ℓ Since ℓ is typically not an identity function, a mini-batch estimator of ∂ θ h ( θ ), obtained by replacing Q with Q N given by � � E Q N [ Φ ( X θ )] , p ( Φ ) E Q N [ ∂ θ Φ ( X θ )] , ∂ θ h N ( θ ) := ∂ x ℓ is a biased estimator of ∂ θ h . Lemma 1 For ℓ ( x , y ) = | x − y | 2 , we have � � E Q � � � � � 1 / 2 � � 1 / 2 . � ≤ 2 V ar Q [ Φ ( X θ )] V ar Q [ ∂ θ Φ ( X θ )] ∂ θ h N ( θ ) − ∂ θ h ( θ ) N 18 / 45

  26. Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , 19 / 45

  27. Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] 19 / 45

  28. Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] Martingale representation theorem via functional Itˆ o calculus � T � � � � F β ( X β ( X β dX β t = Φ s ) s ∈ [0 , T ] − ∇ ω Φ r ∧ s ) r ∈ [0 , T ] s . t 19 / 45

  29. Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] Martingale representation theorem via functional Itˆ o calculus � T � � � � F β ( X β ( X β dX β t = Φ s ) s ∈ [0 , T ] − ∇ ω Φ r ∧ s ) r ∈ [0 , T ] s . t � � � T � � � � ∇ ω F β ( X β dX s | ( X β V ( X β Φ s ) s ∈ [0 , T ] − r ∧ s ) r ∈ [0 , T ] s ∧ t ) s ∈ [0 , T ] = 0 t t 19 / 45

  30. Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] Martingale representation theorem via functional Itˆ o calculus � T � � � � F β ( X β ( X β dX β t = Φ s ) s ∈ [0 , T ] − ∇ ω Φ r ∧ s ) r ∈ [0 , T ] s . t � � � T � � � � ∇ ω F β ( X β dX s | ( X β V ( X β Φ s ) s ∈ [0 , T ] − r ∧ s ) r ∈ [0 , T ] s ∧ t ) s ∈ [0 , T ] = 0 t t ◮ Can learn (parametric) path dependent PDEs ◮ We have unbiased approximation to the PDE by hybrid Monte Carlo/deep learning, see [Vidales et al., 2018] 19 / 45

  31. Neural SDEs - Algorithm Input: π = { t 0 , t 1 , . . . , t N steps } time grid for numerical scheme. N prices Input: ( Φ i ) option payo ff s. i =1 Input: Market option prices p ( Φ j ), j = 1 , . . . , N prices . for epoch : 1 : N epochs do ) N steps ) N steps Generate N trn paths ( x π , θ , i := ( s π , θ , i , v π , θ , i n =0 , i = 1 , . . . , N trn using t n n =0 t n t n Euler scheme. During one epoch : Freeze ξ , use Adam to update θ , where    N prices N steps − 1 � X π , θ � � � θ = � t k , ξ j ) ∆ ˜ X π , S π ,  E N trn  Φ j ¯ h ( t k , ˜ ¯  − argmin t k θ j =1 k =0 � 2 − p ( Φ j ) During one epoch : Freeze θ , use Adam to update ξ , by optimising the sample variance   N prices N steps − 1 � X π , θ � � � , ξ j ) ∆ ˜ ξ = � h ( t k , X π , θ S π , θ V ar N trn  Φ j ¯ ¯  − argmin t k t k ξ j =1 k =0 end for N prices return θ , ξ j for all prices ( Φ i ) . i =1 20 / 45

  32. Results We calibrate (local) Stochastic Volatility model dS t = rS t dt + σ S ( t , S t , V t , ν ) S t dB S t , S 0 = 1 , dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB V t , V 0 = v 0 , d 〈 B S , B V 〉 t = ρ dt to European option prices p ( Φ ) := E Q ( θ ) [ Φ ] = e − rT E Q ( θ ) � � ( S T − K ) + | S 0 = 1 for maturities of 2 , 4 , . . . , 12 months and typically 21 uniformly spaced strikes between in [0 . 8 , 1 . 2]. 21 / 45

  33. Results We calibrate (local) Stochastic Volatility model dS t = rS t dt + σ S ( t , S t , V t , ν ) S t dB S t , S 0 = 1 , dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB V t , V 0 = v 0 , d 〈 B S , B V 〉 t = ρ dt to European option prices p ( Φ ) := E Q ( θ ) [ Φ ] = e − rT E Q ( θ ) � � ( S T − K ) + | S 0 = 1 for maturities of 2 , 4 , . . . , 12 months and typically 21 uniformly spaced strikes between in [0 . 8 , 1 . 2]. As an example of an illiquid derivative for which we wish to find robust bounds we take the lookback option � � p ( Ψ ) := E Q ( θ ) [ Ψ ] = e − rT E Q ( θ ) t ∈ [0 , T ] S t − S T | X 0 = 1 max . 21 / 45

  34. Results We calibrate (local) Stochastic Volatility model dS t = rS t dt + σ S ( t , S t , V t , ν ) S t dB S t , S 0 = 1 , dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB V t , V 0 = v 0 , d 〈 B S , B V 〉 t = ρ dt to European option prices p ( Φ ) := E Q ( θ ) [ Φ ] = e − rT E Q ( θ ) � � ( S T − K ) + | S 0 = 1 for maturities of 2 , 4 , . . . , 12 months and typically 21 uniformly spaced strikes between in [0 . 8 , 1 . 2]. As an example of an illiquid derivative for which we wish to find robust bounds we take the lookback option � � p ( Ψ ) := E Q ( θ ) [ Ψ ] = e − rT E Q ( θ ) t ∈ [0 , T ] S t − S T | X 0 = 1 max . We generate synthetic data using Heston model. 21 / 45

  35. Calibration to market prices Figure: Vanilla option prices and implied volatility curves of the 10 calibrated Neural SDEs vs. the market data for di ff erent maturities. 22 / 45

  36. Robust pricing Figure: Exotic option price are in blue; Calibration error i in grey. The three box-plots in each group arise respectively from aiming for a lower bound, ad hoc and upper bound price of illiquid derivative. Each box plot comes from 10 di ff erent runs of Neural SDE calibration. 23 / 45

  37. Control Variate e ff ect on training Figure: Root Mean Squared Error of calibration to Vanilla option prices with and without hedging strategy parametrisation 24 / 45

  38. Joint SPX and VIX calibration with neural SDEs Consider the Neural SDE dS θ t = S θ t σ ( t , V θ t ; θ ) dW t , dV θ t = a ( t , V θ t ; θ ) dt + b ( t , V θ t ; θ ) dB t , ρ = 〈 dW , dB 〉 t . It can be shown that the VIX dynamics at time t ∈ [0 , T ] can be expressed as � �� t + ∆ τ � �� � � S t + ∆ τ � � � 1 = − 2 , ∆ τ = 30 � � VIX 2 ∆ τ E σ 2 ∆ τ E t := s d s � F t � log � F t S t 365 t The VIX future with maturity maturity T is then given by F VIX := E [ VIX T |F t ] t , T VIX options are defined as ( VIX T − K ) + � ( K − VIX T ) + � � � � � � � C VIX ( T , K ) := E , P VIX ( T , K ) := E � F t � F t . t t joint work with: Antoine Jacquier, Marc Sabate Vidales, David Siska, Zan Zuric. 25 / 45

  39. Calibration to market data Figure: Calibration to market data (data source: OptionMetrics) containing SPX options, VIX options an VIX future for T = 1 , ..., 6 months 26 / 45

  40. Calibration to market data Figure: Calibrated neural SDE errors on SPX options and VIX options. Hatches correspond to combinations of Maturity/Strike for which there was not market data available 27 / 45

  41. Extensions ◮ Neural SDE model in real-world measure P ( θ ) 28 / 45

  42. Extensions ◮ Neural SDE model in real-world measure P ( θ ) ◮ Let ζ : [0 , T ] × R d × R p → R n be another parametric function ◮ Let b S , P ( t , X θ t , θ ) := rS θ t + σ S ( t , X θ t , θ ) ζ ( t , X θ t , θ ) , b V , P ( t , X θ t , θ ) := b V ( t , X θ t , θ ) + σ V ( t , X θ t , θ ) ζ ( t , X θ t , θ ) . ◮ We now define a real-world measure P ( θ ) via the Radon–Nikodym derivative �� T � � T d P ( θ ) t , θ ) dW t + 1 t , θ ) | 2 dt ζ ( t , X θ | ζ ( t , X θ d Q ( θ ) := exp . 2 0 0 28 / 45

  43. Extensions ◮ Neural SDE model in real-world measure P ( θ ) ◮ Let ζ : [0 , T ] × R d × R p → R n be another parametric function ◮ Let b S , P ( t , X θ t , θ ) := rS θ t + σ S ( t , X θ t , θ ) ζ ( t , X θ t , θ ) , b V , P ( t , X θ t , θ ) := b V ( t , X θ t , θ ) + σ V ( t , X θ t , θ ) ζ ( t , X θ t , θ ) . ◮ We now define a real-world measure P ( θ ) via the Radon–Nikodym derivative �� T � � T d P ( θ ) t , θ ) dW t + 1 t , θ ) | 2 dt ζ ( t , X θ | ζ ( t , X θ d Q ( θ ) := exp . 2 0 0 ◮ Under appropriate assumption on ζ (e.g. bounded) the measure P ( θ ) is a probability measure and by using Girsanov theorem we can find Brownian motion ( W P ( θ ) ) t ∈ [0 , T ] such that t t , θ ) dW P ( θ ) t = b S , P ( t , X θ dS θ t , θ ) dt + σ S ( t , X θ , t t , θ ) dt + σ V ( t , X θ t , θ ) dW P ( θ ) t = b V , P ( t , X θ dV θ . t 28 / 45

  44. Extensions ◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training 29 / 45

  45. Extensions ◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training ◮ We can use neural SDEs to adversarially train hedging strategies using ideas from distributionally robust optimisation 29 / 45

  46. Extensions ◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training ◮ We can use neural SDEs to adversarially train hedging strategies using ideas from distributionally robust optimisation ◮ We can simplify the learned models using ideas from explainable machine learning 29 / 45

  47. Neural SDEs Pros: ◮ Expressive yet consistent with classical framework ◮ By design data driven, adaptable to changes in environment ◮ Provide consistent models for calibrating under Q and P ◮ Provide systematic framework for model selection ◮ Ability to learn in law data regime (due to good prior) ◮ (Some) Theoretical guarantees for the generalisation error 30 / 45

  48. Neural SDEs Pros: ◮ Expressive yet consistent with classical framework ◮ By design data driven, adaptable to changes in environment ◮ Provide consistent models for calibrating under Q and P ◮ Provide systematic framework for model selection ◮ Ability to learn in law data regime (due to good prior) ◮ (Some) Theoretical guarantees for the generalisation error Cons: ◮ Parameters are not interpretable, but the models are. ◮ Computationally more intense than classical models, but because we train with gradient descent recalibration (typically) is cheap. 30 / 45

  49. Neural ODEs - perspective on recurrent neural network 31 / 45

  50. Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d 32 / 45

  51. Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d ◮ Infinite network (useful when fitting time-series data) dX ξ t ( θ ) = φ ( X ξ t ( θ ) , θ t ) dt , t ∈ [0 , 1] , X 0 = ξ ∈ R d . 32 / 45

  52. Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d ◮ Infinite network (useful when fitting time-series data) dX ξ t ( θ ) = φ ( X ξ t ( θ ) , θ t ) dt , t ∈ [0 , 1] , X 0 = ξ ∈ R d . ◮ Take input-output data ( ξ , ζ ) ∼ M . Our objective is to minimize � � � T ( θ ) | 2 M ( d ξ , d ζ ) . R d × R d | ζ − X ξ J ( θ ) t ∈ [0 , T ] := 32 / 45

  53. Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d ◮ Infinite network (useful when fitting time-series data) dX ξ t ( θ ) = φ ( X ξ t ( θ ) , θ t ) dt , t ∈ [0 , 1] , X 0 = ξ ∈ R d . ◮ Take input-output data ( ξ , ζ ) ∼ M . Our objective is to minimize � � � T ( θ ) | 2 M ( d ξ , d ζ ) . R d × R d | ζ − X ξ J ( θ ) t ∈ [0 , T ] := ◮ Goal: Find ˜ θ such that � �� d � θ + ε (˜ d ε J θ − θ ) ε =0 ≤ 0 . � 32 / 45

  54. Relaxed Stochastic Control and Deep Learning ◮ Mean-field perspective on neural networks � n � 1 R d βϕ ( α · z + ρ · ζ ) ν n ( d β , d α , d ρ ) . β n , i ϕ ( α n , i · z + ρ i , n · ζ ) = n i =1 33 / 45

  55. Relaxed Stochastic Control and Deep Learning ◮ Mean-field perspective on neural networks � n � 1 R d βϕ ( α · z + ρ · ζ ) ν n ( d β , d α , d ρ ) . β n , i ϕ ( α n , i · z + ρ i , n · ζ ) = n i =1 ◮ Let φ ( z , a , ζ ) = βϕ ( α · z + ρ · ζ ), and consider � t � X ν , ξ , ζ φ ( X ν , ξ , ζ = ξ + , a , ζ ) ν r ( da ) dr t r 0 33 / 45

  56. Relaxed Stochastic Control and Deep Learning ◮ �� T � � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ � � �� m ( x ) R d m ( x ) log dx if m is a.c. w.r.t. Lebesgue measure g ( x ) Ent( m ) := ∞ otherwise and Gibbs measure g : � g ( x ) = e − U ( x ) with U s.t. R d e − U ( x ) dx = 1 . ◮ See work by Weinan E [Weinan, 2017]; Cruchiero, Larsson, Teichmann,[Cuchiero et al., 2019]; Hu, Kazeykina, Ren [Hu et al., 2019] 34 / 45

  57. Relaxed Stochastic Control and Deep Learning ◮ The goal is to find, for each t ∈ [0 , T ] a vector field flow ( b s , t ) s ≥ 0 such that the measure flow ( ν s , t ) s ≥ 0 given by ∂ s ν s , t = div( ν s , t b s , t ) , s ≥ 0 , ν 0 , t = ν 0 t ∈ P 2 ( R p ) , satisfies that s �→ J σ ( ν s , · ) is decreasing. 35 / 45

  58. Relaxed Stochastic Control and Deep Learning ◮ The goal is to find, for each t ∈ [0 , T ] a vector field flow ( b s , t ) s ≥ 0 such that the measure flow ( ν s , t ) s ≥ 0 given by ∂ s ν s , t = div( ν s , t b s , t ) , s ≥ 0 , ν 0 , t = ν 0 t ∈ P 2 ( R p ) , satisfies that s �→ J σ ( ν s , · ) is decreasing. ◮ Relaxed Hamiltonian � h t ( x , p , a , ζ ) m ( da ) + σ 2 H σ t ( x , p , m , ζ ) := 2 Ent( m ) h t ( x , p , a , ζ ) := φ t ( x , a , ζ ) p + f t ( x , a , ζ ) 35 / 45

  59. Relaxed Stochastic Control and Deep Learning ◮ The goal is to find, for each t ∈ [0 , T ] a vector field flow ( b s , t ) s ≥ 0 such that the measure flow ( ν s , t ) s ≥ 0 given by ∂ s ν s , t = div( ν s , t b s , t ) , s ≥ 0 , ν 0 , t = ν 0 t ∈ P 2 ( R p ) , satisfies that s �→ J σ ( ν s , · ) is decreasing. ◮ Relaxed Hamiltonian � h t ( x , p , a , ζ ) m ( da ) + σ 2 H σ t ( x , p , m , ζ ) := 2 Ent( m ) h t ( x , p , a , ζ ) := φ t ( x , a , ζ ) p + f t ( x , a , ζ ) ◮ The adjoint process P ξ , ζ ( ν ) = ( ∇ x g )( X ξ , ζ ( ν ) , ζ ), t t dP ξ , ζ ( ν ) t = − ( ∇ x H t )( X ξ , ζ ( ν ) t , P ν , ξ , ζ ( ν ) , ν t ) dt t 35 / 45

  60. Pontryagin’s principle Theorem 2 If ν ∈ V 2 is (locally) optimal then it must solve the following system: � t ( X ξ , ζ , P ξ , ζ H σ ν t = argmin , µ, ζ ) M ( d ξ , d ζ ) , t t µ ∈ P 2 ( R p ) R d × S dX ξ , ζ = Φ ( X ξ , ζ , ν t , ζ ) dt , , X ξ , ζ = ξ ∈ R d t t 0 dP ξ , ζ = − ( ∇ x H t )( X ξ , ζ , P ξ , ζ , ν t , ζ ) dt , P ξ , ζ = ( ∇ x g )( X ξ , ζ T , ζ ) . t t t T 36 / 45

  61. Gradient Flow Dynamics �� � s , t , θ s , t , ζ ) M ( d ξ , d ζ ) + σ 2 ( ∇ a h t )( X ξ , ζ s , t , P ξ , ζ d θ s , t = − 2 ( ∇ a U )( θ s , t ) ds + σ dB s R d × S where for t ∈ [0 , T ]  ν s , t = L ( θ s , t )    � t    X ξ , ζ Φ r ( X ξ , ζ s , t = ξ + s , r , ν s , r , ζ ) dr 0   � T    P ξ , ζ s , t = ( ∇ x g )( X ξ , ζ  ( ∇ x H r )( X ξ , ζ s , r , P ξ , ζ T , ζ ) + s , r , ν s , r , ζ ) dr . t 37 / 45

  62. Main Result Theorem 3 Assume that σ > 0 . Then i) if ν � ∈ argmin ν ∈ V 2 J σ ( ν ) then ν � is an invariant measure given by ν � ( a ) = e − 2 σ 2 h t ( a , ν � , M ) g ( a ) , 38 / 45

  63. Main Result Theorem 3 Assume that σ > 0 . Then i) if ν � ∈ argmin ν ∈ V 2 J σ ( ν ) then ν � is an invariant measure given by ν � ( a ) = e − 2 σ 2 h t ( a , ν � , M ) g ( a ) , ii) if σ 2 κ − 4 L > 0 then ν � is unique and for all s ≥ 0 any any L ( θ 0 , · ) 2 ( L ( θ s , · ) , ν � ) 2 ≤ e − λ s W T 2 ( L ( θ 0 , · ) , ν � ) 2 . W T ◮ �� T � 1 / q W q ( µ t , ν t ) q dt W T q ( µ, ν ) := . 0 ◮ � h t ( X ξ , ζ ( µ ) , P ξ , ζ h t ( a , µ, M ) := ( µ ) , a , ζ ) M ( d ξ , d ζ ) , t t R d × S 38 / 45

  64. Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 39 / 45

  65. Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ In practice, one does not have access to population distribution M , � N 1 but works with finite sample M N 1 := 1 j 1 δ ( ξ j 1 , ζ j 1 ) N 1 39 / 45

  66. Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ In practice, one does not have access to population distribution M , � N 1 but works with finite sample M N 1 := 1 j 1 δ ( ξ j 1 , ζ j 1 ) N 1 ◮ Practioner use J M N 1 ( ν ) and NOT J σ , M N 1 ( ν ) to set stopping criteria for learning 39 / 45

  67. Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ In practice, one does not have access to population distribution M , � N 1 but works with finite sample M N 1 := 1 j 1 δ ( ξ j 1 , ζ j 1 ) N 1 ◮ Practioner use J M N 1 ( ν ) and NOT J σ , M N 1 ( ν ) to set stopping criteria for learning ◮ The entropy term can be viewed as implicit regularisation 39 / 45

  68. Generalisation Error Theorem 4 Let σ 2 κ >> 0 . There is c > 0 independent of λ , S, N 1 , N 2 , d, p s.t �� 2 � � � � e − λ S + 1 N 1 + 1 � � � J M ( ν � , σ ) − J M ( ν σ , N 1 , N 2 , ∆ s E ) � ≤ c N 2 + h , S , · The generalisation error is given by J M ( ν σ , N 1 , N 2 , ∆ s ) S , · � T ) − J M ( ν � , σ ) − σ 2 = J M ( ν σ , N 1 , N 2 , ∆ s Ent ( ν � , σ ) + min µ ∈ V 2 J σ , M ( µ ) . S , · 2 0 ◮ N 1 - size of the training data ◮ N 2 - proxy to the the number of parameters ◮ γ - learning rate ◮ S / γ - proxy for training time ◮ By discretising ODEs we can get estimateson the number of layers 40 / 45

  69. Assumption 1 Fix � > 0 and N 1 > 0 . Assume that ∀ M N 1 J M 1 ( ν � , σ , N 1 ) ≤ � . Theorem 5 There is c > 0 independent of λ , S, N 1 , N 2 , d, p s.t �� 2 � � � � e − λ S + 1 + 1 � � ≤ � 2 + c � J M ( ν σ , N 1 , N 2 , ∆ s E ) � + h . S , · N 1 N 2 41 / 45

  70. Outlook We have full analysis of convergence of regularised gradient descent algorithm for deep networks modelled by ODEs. 42 / 45

  71. Outlook We have full analysis of convergence of regularised gradient descent algorithm for deep networks modelled by ODEs. Key messages: ◮ Training of neural nets should be viewed as sampling rather then optimisation problem ◮ Wasserstein Gradient flow provides framework to study convergence of training algorithms ◮ Probabilistic numerical analysis provides quantitative bounds that do not su ff er from curse of dimensionality 42 / 45

  72. References I [Arribas et al., 2020] Arribas, I. P., Salvi, C., and Szpruch, L. (2020). Sig-sdes model for quantitative finance. arXiv preprint arXiv:2006.00218 . [Belkin et al., 2018] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning and the bias-variance trade-o ff . arXiv preprint arXiv:1812.11118 . [Buehler et al., 2019] Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative Finance , 19(8):1271–1291. [Buehler et al., 2020] Buehler, H., Horvath, B., Lyons, T., Perez Arribas, I., and Wood, B. (2020). Generating financial markets with signatures. Available at SSRN . [Cuchiero et al., 2020] Cuchiero, C., Khosrawi, W., and Teichmann, J. (2020). A generative adversarial network approach to calibration of local stochastic volatility models. arXiv preprint arXiv:2005.02505 . [Cuchiero et al., 2019] Cuchiero, C., Larsson, M., and Teichmann, J. (2019). Deep neural networks, generic universal interpolation, and controlled odes. arXiv preprint arXiv:1908.07838 . [Gierjatowicz et al., 2020] Gierjatowicz, P., Sabate-Vidales, M., Siska, D., Szpruch, L., and Zuric, Z. (2020). Robust pricing and hedging via neural sdes. Available at SSRN 3646241 . [Heiss et al., 2019] Heiss, J., Teichmann, J., and Wutte, H. (2019). How implicit regularization of neural networks a ff ects the learned function–part i. arXiv preprint arXiv:1911.02903 . [Henry-Labordere, 2019] Henry-Labordere, P. (2019). Generative models for financial data. Available at SSRN 3408007 . [Hernandez, 2016] Hernandez, A. (2016). Model calibration with neural networks. Available at SSRN 2812140 . 43 / 45

Recommend


More recommend