Robust pricing and hedging via neural SDEs Lukasz Szpruch - PowerPoint PPT Presentation

Generative modelling ◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν ”. We write T # µ = ν . ◮ Parametrise transport map T ( θ ), θ ∈ R p , e.g some network architecture or Heston model ◮ Seek θ � s.t T ( θ � ) # µ ≈ ν . ◮ Need to make the choice of the metric � � D ( T ( θ ) # µ, ν ) := sup � f ( x )( T ( θ ) # µ )( dx ) − f ( x ) ν ( dx ) � f ∈ K ◮ K could be set of options we want to calibrate to, could be neural network 10 / 45

Generative modelling ◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν ”. We write T # µ = ν . ◮ Parametrise transport map T ( θ ), θ ∈ R p , e.g some network architecture or Heston model ◮ Seek θ � s.t T ( θ � ) # µ ≈ ν . ◮ Need to make the choice of the metric � � D ( T ( θ ) # µ, ν ) := sup � f ( x )( T ( θ ) # µ )( dx ) − f ( x ) ν ( dx ) � f ∈ K ◮ K could be set of options we want to calibrate to, could be neural network ◮ The modelling choices are 1. metric D 2. parametrisation of T 3. algorithm used for training!!! 10 / 45

Generative modelling in finance Pros: ◮ Expressive and work in high dimensions ◮ By design data driven, adaptable to change in environment ◮ Provide new perspective on classical problems in finance 11 / 45

Generative modelling in finance Pros: ◮ Expressive and work in high dimensions ◮ By design data driven, adaptable to change in environment ◮ Provide new perspective on classical problems in finance Cons: ◮ Parameters are not interpretable - black box approach ◮ Training algorithms are data hungry ◮ Models might be hard to work with e.g how to go fro Q to P ? ◮ A field largely empirical, lack of standardised benchmarks, lack of theoretical guarantees 11 / 45

Robust pricing and hedging via neural SDEs 12 / 45

Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p 13 / 45

Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p ◮ Parametric model induces martingale measure Q ( θ ) 13 / 45

Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p ◮ Parametric model induces martingale measure Q ( θ ) ◮ Input Data: prices of traded derivatives p ( Φ i ) M i =0 with corresponding payo ff s ( Φ i ) M i =0 13 / 45

Model Calibration Classical Calibration: ◮ Pick a parametric model ( S t ( θ )) t ∈ [0 , T ] (e.g an Itˆ o process) with parameters θ ∈ R p ◮ Parametric model induces martingale measure Q ( θ ) ◮ Input Data: prices of traded derivatives p ( Φ i ) M i =0 with corresponding payo ff s ( Φ i ) M i =0 ◮ Output : θ ∗ such that p ( Φ i ) ≈ E Q ( Θ ∗ ) [ Φ i ] 13 / 45

Robust Price bounds ◮ There are infinitely many models that are consistent with the market 14 / 45

Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data 14 / 45

Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup E [ Ψ ] and Q ∈ M E [ Ψ ] inf Q ∈ M ◮ Use duality theory to deduce (semi-static) hedging strategy 14 / 45

Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup E [ Ψ ] and Q ∈ M E [ Ψ ] inf Q ∈ M ◮ Use duality theory to deduce (semi-static) hedging strategy ◮ The obtain bounds typically to wide to be of practical value 14 / 45

Robust Price bounds ◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup E [ Ψ ] and Q ∈ M E [ Ψ ] inf Q ∈ M ◮ Use duality theory to deduce (semi-static) hedging strategy ◮ The obtain bounds typically to wide to be of practical value ◮ Challenges: a) Incorporate prior information to restrict a search space M b) Design e ffi cient algorithms for computing price bounds and corresponding hedges 14 / 45

Cla��ical Ri�k Model� Gene�a�i�e Model�� Ne��al SDE� Rob�� Finance� 15 / 45

Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) 16 / 45

Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) 16 / 45

Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport 16 / 45

Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. 16 / 45

Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. ◮ See [Arribas et al., 2020] for Sig-SDEs (neural SDE in a signature feature space) 16 / 45

Neural SDEs ◮ We build an Itˆ o process ( X θ t ) t ∈ [0 , T ] , with parameters θ ∈ R p dS θ t = rS θ t dt + σ S ( t , X θ t , θ ) dW t , t = b V ( t , X θ t , θ ) dt + σ V ( t , X θ dV θ t , θ ) dW t , X θ t = ( S θ t , V θ t ) , where σ S , b V , σ V are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q ( θ ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. ◮ See [Arribas et al., 2020] for Sig-SDEs (neural SDE in a signature feature space) ◮ Neural SDEs are easy to work with e.g consistent change from Q to P . 16 / 45

Neural SDEs i) Calibration to market prices Find model parameters θ ∗ such that model prices match market prices: M � θ ∗ ∈ arg min ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) . θ ∈ Θ i =1 17 / 45

Neural SDEs i) Calibration to market prices Find model parameters θ ∗ such that model prices match market prices: M � θ ∗ ∈ arg min ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) . θ ∈ Θ i =1 ii) Robust pricing Find model parameters θ l , ∗ and θ u , ∗ which provide robust arbitrage-free price bounds for an illiquid derivative, subject to available market data: � M θ l , ∗ ∈ arg min θ ∈ Θ E Q ( θ ) [ Ψ ] , ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) = 0 , subject to i =1 M � θ u , ∗ ∈ arg max θ ∈ Θ E Q ( θ ) [ Ψ ] , ℓ ( E Q ( θ ) [ Φ i ] , p ( Φ i )) = 0 . subject to i =1 where ℓ : R × R → [0 , ∞ ) is a convex loss function such that min x ∈ R , y ∈ R ℓ ( x , y ) = 0. 17 / 45

Stochastic Optimisation Let M = 1 and the loss function � � E Q ( θ ) [ Φ ] , p ( Φ ) h ( θ ) = ℓ . Then in the gradient step update we have � � E Q [ Φ ( X θ )] , p ( Φ ) E Q [ ∂ θ Φ ( X θ )] , ∂ θ h ( θ ) = ∂ x ℓ Since ℓ is typically not an identity function, a mini-batch estimator of ∂ θ h ( θ ), obtained by replacing Q with Q N given by � � E Q N [ Φ ( X θ )] , p ( Φ ) E Q N [ ∂ θ Φ ( X θ )] , ∂ θ h N ( θ ) := ∂ x ℓ is a biased estimator of ∂ θ h . 18 / 45

Stochastic Optimisation Let M = 1 and the loss function � � E Q ( θ ) [ Φ ] , p ( Φ ) h ( θ ) = ℓ . Then in the gradient step update we have � � E Q [ Φ ( X θ )] , p ( Φ ) E Q [ ∂ θ Φ ( X θ )] , ∂ θ h ( θ ) = ∂ x ℓ Since ℓ is typically not an identity function, a mini-batch estimator of ∂ θ h ( θ ), obtained by replacing Q with Q N given by � � E Q N [ Φ ( X θ )] , p ( Φ ) E Q N [ ∂ θ Φ ( X θ )] , ∂ θ h N ( θ ) := ∂ x ℓ is a biased estimator of ∂ θ h . Lemma 1 For ℓ ( x , y ) = | x − y | 2 , we have � � E Q � � � � � 1 / 2 � � 1 / 2 . � ≤ 2 V ar Q [ Φ ( X θ )] V ar Q [ ∂ θ Φ ( X θ )] ∂ θ h N ( θ ) − ∂ θ h ( θ ) N 18 / 45

Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , 19 / 45

Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] 19 / 45

Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] Martingale representation theorem via functional Itˆ o calculus � T � � � � F β ( X β ( X β dX β t = Φ s ) s ∈ [0 , T ] − ∇ ω Φ r ∧ s ) r ∈ [0 , T ] s . t 19 / 45

Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] Martingale representation theorem via functional Itˆ o calculus � T � � � � F β ( X β ( X β dX β t = Φ s ) s ∈ [0 , T ] − ∇ ω Φ r ∧ s ) r ∈ [0 , T ] s . t � � � T � � � � ∇ ω F β ( X β dX s | ( X β V ( X β Φ s ) s ∈ [0 , T ] − r ∧ s ) r ∈ [0 , T ] s ∧ t ) s ∈ [0 , T ] = 0 t t 19 / 45

Learning PDEs Let X β t = σ ( t , ( X β s ∧ t ) s ∈ [0 , T ] , β ) dW t , � � F β t := F β ( t , ( X β s ) s ∈ [0 , T ] ) | ( X β s ∧ t ) s ∈ [0 , T ] ) = E Φ (( X β s ∧ t ) s ∈ [0 , T ] Martingale representation theorem via functional Itˆ o calculus � T � � � � F β ( X β ( X β dX β t = Φ s ) s ∈ [0 , T ] − ∇ ω Φ r ∧ s ) r ∈ [0 , T ] s . t � � � T � � � � ∇ ω F β ( X β dX s | ( X β V ( X β Φ s ) s ∈ [0 , T ] − r ∧ s ) r ∈ [0 , T ] s ∧ t ) s ∈ [0 , T ] = 0 t t ◮ Can learn (parametric) path dependent PDEs ◮ We have unbiased approximation to the PDE by hybrid Monte Carlo/deep learning, see [Vidales et al., 2018] 19 / 45

Neural SDEs - Algorithm Input: π = { t 0 , t 1 , . . . , t N steps } time grid for numerical scheme. N prices Input: ( Φ i ) option payo ff s. i =1 Input: Market option prices p ( Φ j ), j = 1 , . . . , N prices . for epoch : 1 : N epochs do ) N steps ) N steps Generate N trn paths ( x π , θ , i := ( s π , θ , i , v π , θ , i n =0 , i = 1 , . . . , N trn using t n n =0 t n t n Euler scheme. During one epoch : Freeze ξ , use Adam to update θ , where    N prices N steps − 1 � X π , θ � � � θ = � t k , ξ j ) ∆ ˜ X π , S π ,  E N trn  Φ j ¯ h ( t k , ˜ ¯  − argmin t k θ j =1 k =0 � 2 − p ( Φ j ) During one epoch : Freeze θ , use Adam to update ξ , by optimising the sample variance   N prices N steps − 1 � X π , θ � � � , ξ j ) ∆ ˜ ξ = � h ( t k , X π , θ S π , θ V ar N trn  Φ j ¯ ¯  − argmin t k t k ξ j =1 k =0 end for N prices return θ , ξ j for all prices ( Φ i ) . i =1 20 / 45

Results We calibrate (local) Stochastic Volatility model dS t = rS t dt + σ S ( t , S t , V t , ν ) S t dB S t , S 0 = 1 , dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB V t , V 0 = v 0 , d 〈 B S , B V 〉 t = ρ dt to European option prices p ( Φ ) := E Q ( θ ) [ Φ ] = e − rT E Q ( θ ) � � ( S T − K ) + | S 0 = 1 for maturities of 2 , 4 , . . . , 12 months and typically 21 uniformly spaced strikes between in [0 . 8 , 1 . 2]. 21 / 45

Results We calibrate (local) Stochastic Volatility model dS t = rS t dt + σ S ( t , S t , V t , ν ) S t dB S t , S 0 = 1 , dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB V t , V 0 = v 0 , d 〈 B S , B V 〉 t = ρ dt to European option prices p ( Φ ) := E Q ( θ ) [ Φ ] = e − rT E Q ( θ ) � � ( S T − K ) + | S 0 = 1 for maturities of 2 , 4 , . . . , 12 months and typically 21 uniformly spaced strikes between in [0 . 8 , 1 . 2]. As an example of an illiquid derivative for which we wish to find robust bounds we take the lookback option � � p ( Ψ ) := E Q ( θ ) [ Ψ ] = e − rT E Q ( θ ) t ∈ [0 , T ] S t − S T | X 0 = 1 max . 21 / 45

Results We calibrate (local) Stochastic Volatility model dS t = rS t dt + σ S ( t , S t , V t , ν ) S t dB S t , S 0 = 1 , dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB V t , V 0 = v 0 , d 〈 B S , B V 〉 t = ρ dt to European option prices p ( Φ ) := E Q ( θ ) [ Φ ] = e − rT E Q ( θ ) � � ( S T − K ) + | S 0 = 1 for maturities of 2 , 4 , . . . , 12 months and typically 21 uniformly spaced strikes between in [0 . 8 , 1 . 2]. As an example of an illiquid derivative for which we wish to find robust bounds we take the lookback option � � p ( Ψ ) := E Q ( θ ) [ Ψ ] = e − rT E Q ( θ ) t ∈ [0 , T ] S t − S T | X 0 = 1 max . We generate synthetic data using Heston model. 21 / 45

Calibration to market prices Figure: Vanilla option prices and implied volatility curves of the 10 calibrated Neural SDEs vs. the market data for di ff erent maturities. 22 / 45

Robust pricing Figure: Exotic option price are in blue; Calibration error i in grey. The three box-plots in each group arise respectively from aiming for a lower bound, ad hoc and upper bound price of illiquid derivative. Each box plot comes from 10 di ff erent runs of Neural SDE calibration. 23 / 45

Control Variate e ff ect on training Figure: Root Mean Squared Error of calibration to Vanilla option prices with and without hedging strategy parametrisation 24 / 45

Joint SPX and VIX calibration with neural SDEs Consider the Neural SDE dS θ t = S θ t σ ( t , V θ t ; θ ) dW t , dV θ t = a ( t , V θ t ; θ ) dt + b ( t , V θ t ; θ ) dB t , ρ = 〈 dW , dB 〉 t . It can be shown that the VIX dynamics at time t ∈ [0 , T ] can be expressed as � �� t + ∆ τ � �� S t + ∆ τ � � � 1 = − 2 , ∆ τ = 30 � � VIX 2 ∆ τ E σ 2 ∆ τ E t := s d s � F t � log � F t S t 365 t The VIX future with maturity maturity T is then given by F VIX := E [ VIX T |F t ] t , T VIX options are defined as ( VIX T − K ) + � ( K − VIX T ) + � � � � � � � C VIX ( T , K ) := E , P VIX ( T , K ) := E � F t � F t . t t joint work with: Antoine Jacquier, Marc Sabate Vidales, David Siska, Zan Zuric. 25 / 45

Calibration to market data Figure: Calibration to market data (data source: OptionMetrics) containing SPX options, VIX options an VIX future for T = 1 , ..., 6 months 26 / 45

Calibration to market data Figure: Calibrated neural SDE errors on SPX options and VIX options. Hatches correspond to combinations of Maturity/Strike for which there was not market data available 27 / 45

Extensions ◮ Neural SDE model in real-world measure P ( θ ) 28 / 45

Extensions ◮ Neural SDE model in real-world measure P ( θ ) ◮ Let ζ : [0 , T ] × R d × R p → R n be another parametric function ◮ Let b S , P ( t , X θ t , θ ) := rS θ t + σ S ( t , X θ t , θ ) ζ ( t , X θ t , θ ) , b V , P ( t , X θ t , θ ) := b V ( t , X θ t , θ ) + σ V ( t , X θ t , θ ) ζ ( t , X θ t , θ ) . ◮ We now define a real-world measure P ( θ ) via the Radon–Nikodym derivative �� T � � T d P ( θ ) t , θ ) dW t + 1 t , θ ) | 2 dt ζ ( t , X θ | ζ ( t , X θ d Q ( θ ) := exp . 2 0 0 28 / 45

Extensions ◮ Neural SDE model in real-world measure P ( θ ) ◮ Let ζ : [0 , T ] × R d × R p → R n be another parametric function ◮ Let b S , P ( t , X θ t , θ ) := rS θ t + σ S ( t , X θ t , θ ) ζ ( t , X θ t , θ ) , b V , P ( t , X θ t , θ ) := b V ( t , X θ t , θ ) + σ V ( t , X θ t , θ ) ζ ( t , X θ t , θ ) . ◮ We now define a real-world measure P ( θ ) via the Radon–Nikodym derivative �� T � � T d P ( θ ) t , θ ) dW t + 1 t , θ ) | 2 dt ζ ( t , X θ | ζ ( t , X θ d Q ( θ ) := exp . 2 0 0 ◮ Under appropriate assumption on ζ (e.g. bounded) the measure P ( θ ) is a probability measure and by using Girsanov theorem we can find Brownian motion ( W P ( θ ) ) t ∈ [0 , T ] such that t t , θ ) dW P ( θ ) t = b S , P ( t , X θ dS θ t , θ ) dt + σ S ( t , X θ , t t , θ ) dt + σ V ( t , X θ t , θ ) dW P ( θ ) t = b V , P ( t , X θ dV θ . t 28 / 45

Extensions ◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training 29 / 45

Extensions ◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training ◮ We can use neural SDEs to adversarially train hedging strategies using ideas from distributionally robust optimisation 29 / 45

Extensions ◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training ◮ We can use neural SDEs to adversarially train hedging strategies using ideas from distributionally robust optimisation ◮ We can simplify the learned models using ideas from explainable machine learning 29 / 45

Neural SDEs Pros: ◮ Expressive yet consistent with classical framework ◮ By design data driven, adaptable to changes in environment ◮ Provide consistent models for calibrating under Q and P ◮ Provide systematic framework for model selection ◮ Ability to learn in law data regime (due to good prior) ◮ (Some) Theoretical guarantees for the generalisation error 30 / 45

Neural SDEs Pros: ◮ Expressive yet consistent with classical framework ◮ By design data driven, adaptable to changes in environment ◮ Provide consistent models for calibrating under Q and P ◮ Provide systematic framework for model selection ◮ Ability to learn in law data regime (due to good prior) ◮ (Some) Theoretical guarantees for the generalisation error Cons: ◮ Parameters are not interpretable, but the models are. ◮ Computationally more intense than classical models, but because we train with gradient descent recalibration (typically) is cheap. 30 / 45

Neural ODEs - perspective on recurrent neural network 31 / 45

Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d 32 / 45

Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d ◮ Infinite network (useful when fitting time-series data) dX ξ t ( θ ) = φ ( X ξ t ( θ ) , θ t ) dt , t ∈ [0 , 1] , X 0 = ξ ∈ R d . 32 / 45

Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d ◮ Infinite network (useful when fitting time-series data) dX ξ t ( θ ) = φ ( X ξ t ( θ ) , θ t ) dt , t ∈ [0 , 1] , X 0 = ξ ∈ R d . ◮ Take input-output data ( ξ , ζ ) ∼ M . Our objective is to minimize � � � T ( θ ) | 2 M ( d ξ , d ζ ) . R d × R d | ζ − X ξ J ( θ ) t ∈ [0 , T ] := 32 / 45

Example ◮ Recurrent neural networks can be written X l +1 = X l + φ ( X l , θ l ) X 0 = ξ ∈ R d ◮ Infinite network (useful when fitting time-series data) dX ξ t ( θ ) = φ ( X ξ t ( θ ) , θ t ) dt , t ∈ [0 , 1] , X 0 = ξ ∈ R d . ◮ Take input-output data ( ξ , ζ ) ∼ M . Our objective is to minimize � � � T ( θ ) | 2 M ( d ξ , d ζ ) . R d × R d | ζ − X ξ J ( θ ) t ∈ [0 , T ] := ◮ Goal: Find ˜ θ such that � �� d � θ + ε (˜ d ε J θ − θ ) ε =0 ≤ 0 . � 32 / 45

Relaxed Stochastic Control and Deep Learning ◮ Mean-field perspective on neural networks � n � 1 R d βϕ ( α · z + ρ · ζ ) ν n ( d β , d α , d ρ ) . β n , i ϕ ( α n , i · z + ρ i , n · ζ ) = n i =1 33 / 45

Relaxed Stochastic Control and Deep Learning ◮ Mean-field perspective on neural networks � n � 1 R d βϕ ( α · z + ρ · ζ ) ν n ( d β , d α , d ρ ) . β n , i ϕ ( α n , i · z + ρ i , n · ζ ) = n i =1 ◮ Let φ ( z , a , ζ ) = βϕ ( α · z + ρ · ζ ), and consider � t � X ν , ξ , ζ φ ( X ν , ξ , ζ = ξ + , a , ζ ) ν r ( da ) dr t r 0 33 / 45

Relaxed Stochastic Control and Deep Learning ◮ �� T � � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ � � �� m ( x ) R d m ( x ) log dx if m is a.c. w.r.t. Lebesgue measure g ( x ) Ent( m ) := ∞ otherwise and Gibbs measure g : � g ( x ) = e − U ( x ) with U s.t. R d e − U ( x ) dx = 1 . ◮ See work by Weinan E [Weinan, 2017]; Cruchiero, Larsson, Teichmann,[Cuchiero et al., 2019]; Hu, Kazeykina, Ren [Hu et al., 2019] 34 / 45

Relaxed Stochastic Control and Deep Learning ◮ The goal is to find, for each t ∈ [0 , T ] a vector field flow ( b s , t ) s ≥ 0 such that the measure flow ( ν s , t ) s ≥ 0 given by ∂ s ν s , t = div( ν s , t b s , t ) , s ≥ 0 , ν 0 , t = ν 0 t ∈ P 2 ( R p ) , satisfies that s �→ J σ ( ν s , · ) is decreasing. 35 / 45

Relaxed Stochastic Control and Deep Learning ◮ The goal is to find, for each t ∈ [0 , T ] a vector field flow ( b s , t ) s ≥ 0 such that the measure flow ( ν s , t ) s ≥ 0 given by ∂ s ν s , t = div( ν s , t b s , t ) , s ≥ 0 , ν 0 , t = ν 0 t ∈ P 2 ( R p ) , satisfies that s �→ J σ ( ν s , · ) is decreasing. ◮ Relaxed Hamiltonian � h t ( x , p , a , ζ ) m ( da ) + σ 2 H σ t ( x , p , m , ζ ) := 2 Ent( m ) h t ( x , p , a , ζ ) := φ t ( x , a , ζ ) p + f t ( x , a , ζ ) 35 / 45

Relaxed Stochastic Control and Deep Learning ◮ The goal is to find, for each t ∈ [0 , T ] a vector field flow ( b s , t ) s ≥ 0 such that the measure flow ( ν s , t ) s ≥ 0 given by ∂ s ν s , t = div( ν s , t b s , t ) , s ≥ 0 , ν 0 , t = ν 0 t ∈ P 2 ( R p ) , satisfies that s �→ J σ ( ν s , · ) is decreasing. ◮ Relaxed Hamiltonian � h t ( x , p , a , ζ ) m ( da ) + σ 2 H σ t ( x , p , m , ζ ) := 2 Ent( m ) h t ( x , p , a , ζ ) := φ t ( x , a , ζ ) p + f t ( x , a , ζ ) ◮ The adjoint process P ξ , ζ ( ν ) = ( ∇ x g )( X ξ , ζ ( ν ) , ζ ), t t dP ξ , ζ ( ν ) t = − ( ∇ x H t )( X ξ , ζ ( ν ) t , P ν , ξ , ζ ( ν ) , ν t ) dt t 35 / 45

Pontryagin’s principle Theorem 2 If ν ∈ V 2 is (locally) optimal then it must solve the following system: � t ( X ξ , ζ , P ξ , ζ H σ ν t = argmin , µ, ζ ) M ( d ξ , d ζ ) , t t µ ∈ P 2 ( R p ) R d × S dX ξ , ζ = Φ ( X ξ , ζ , ν t , ζ ) dt , , X ξ , ζ = ξ ∈ R d t t 0 dP ξ , ζ = − ( ∇ x H t )( X ξ , ζ , P ξ , ζ , ν t , ζ ) dt , P ξ , ζ = ( ∇ x g )( X ξ , ζ T , ζ ) . t t t T 36 / 45

Gradient Flow Dynamics �� s , t , θ s , t , ζ ) M ( d ξ , d ζ ) + σ 2 ( ∇ a h t )( X ξ , ζ s , t , P ξ , ζ d θ s , t = − 2 ( ∇ a U )( θ s , t ) ds + σ dB s R d × S where for t ∈ [0 , T ]  ν s , t = L ( θ s , t )    � t    X ξ , ζ Φ r ( X ξ , ζ s , t = ξ + s , r , ν s , r , ζ ) dr 0   � T    P ξ , ζ s , t = ( ∇ x g )( X ξ , ζ  ( ∇ x H r )( X ξ , ζ s , r , P ξ , ζ T , ζ ) + s , r , ν s , r , ζ ) dr . t 37 / 45

Main Result Theorem 3 Assume that σ > 0 . Then i) if ν � ∈ argmin ν ∈ V 2 J σ ( ν ) then ν � is an invariant measure given by ν � ( a ) = e − 2 σ 2 h t ( a , ν � , M ) g ( a ) , 38 / 45

Main Result Theorem 3 Assume that σ > 0 . Then i) if ν � ∈ argmin ν ∈ V 2 J σ ( ν ) then ν � is an invariant measure given by ν � ( a ) = e − 2 σ 2 h t ( a , ν � , M ) g ( a ) , ii) if σ 2 κ − 4 L > 0 then ν � is unique and for all s ≥ 0 any any L ( θ 0 , · ) 2 ( L ( θ s , · ) , ν � ) 2 ≤ e − λ s W T 2 ( L ( θ 0 , · ) , ν � ) 2 . W T ◮ �� T � 1 / q W q ( µ t , ν t ) q dt W T q ( µ, ν ) := . 0 ◮ � h t ( X ξ , ζ ( µ ) , P ξ , ζ h t ( a , µ, M ) := ( µ ) , a , ζ ) M ( d ξ , d ζ ) , t t R d × S 38 / 45

Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 39 / 45

Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ In practice, one does not have access to population distribution M , � N 1 but works with finite sample M N 1 := 1 j 1 δ ( ξ j 1 , ζ j 1 ) N 1 39 / 45

Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ In practice, one does not have access to population distribution M , � N 1 but works with finite sample M N 1 := 1 j 1 δ ( ξ j 1 , ζ j 1 ) N 1 ◮ Practioner use J M N 1 ( ν ) and NOT J σ , M N 1 ( ν ) to set stopping criteria for learning 39 / 45

Generalisation Error ◮ Recall the cost function � �� T � � J σ , M ( ν ) := f t ( X ν , ξ , ζ , a , ζ ) ν t ( da ) dt + g ( X ν , ξ , ζ , ζ ) M ( d ξ , d ζ ) t T R d × S 0 � T + σ 2 Ent( ν t ) dt . 2 0 ◮ In practice, one does not have access to population distribution M , � N 1 but works with finite sample M N 1 := 1 j 1 δ ( ξ j 1 , ζ j 1 ) N 1 ◮ Practioner use J M N 1 ( ν ) and NOT J σ , M N 1 ( ν ) to set stopping criteria for learning ◮ The entropy term can be viewed as implicit regularisation 39 / 45

Generalisation Error Theorem 4 Let σ 2 κ >> 0 . There is c > 0 independent of λ , S, N 1 , N 2 , d, p s.t �� 2 � � � � e − λ S + 1 N 1 + 1 � � � J M ( ν � , σ ) − J M ( ν σ , N 1 , N 2 , ∆ s E ) � ≤ c N 2 + h , S , · The generalisation error is given by J M ( ν σ , N 1 , N 2 , ∆ s ) S , · � T ) − J M ( ν � , σ ) − σ 2 = J M ( ν σ , N 1 , N 2 , ∆ s Ent ( ν � , σ ) + min µ ∈ V 2 J σ , M ( µ ) . S , · 2 0 ◮ N 1 - size of the training data ◮ N 2 - proxy to the the number of parameters ◮ γ - learning rate ◮ S / γ - proxy for training time ◮ By discretising ODEs we can get estimateson the number of layers 40 / 45

Assumption 1 Fix � > 0 and N 1 > 0 . Assume that ∀ M N 1 J M 1 ( ν � , σ , N 1 ) ≤ � . Theorem 5 There is c > 0 independent of λ , S, N 1 , N 2 , d, p s.t �� 2 � � � � e − λ S + 1 + 1 � � ≤ � 2 + c � J M ( ν σ , N 1 , N 2 , ∆ s E ) � + h . S , · N 1 N 2 41 / 45

Outlook We have full analysis of convergence of regularised gradient descent algorithm for deep networks modelled by ODEs. 42 / 45

Outlook We have full analysis of convergence of regularised gradient descent algorithm for deep networks modelled by ODEs. Key messages: ◮ Training of neural nets should be viewed as sampling rather then optimisation problem ◮ Wasserstein Gradient flow provides framework to study convergence of training algorithms ◮ Probabilistic numerical analysis provides quantitative bounds that do not su ff er from curse of dimensionality 42 / 45

References I [Arribas et al., 2020] Arribas, I. P., Salvi, C., and Szpruch, L. (2020). Sig-sdes model for quantitative finance. arXiv preprint arXiv:2006.00218 . [Belkin et al., 2018] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning and the bias-variance trade-o ff . arXiv preprint arXiv:1812.11118 . [Buehler et al., 2019] Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative Finance , 19(8):1271–1291. [Buehler et al., 2020] Buehler, H., Horvath, B., Lyons, T., Perez Arribas, I., and Wood, B. (2020). Generating financial markets with signatures. Available at SSRN . [Cuchiero et al., 2020] Cuchiero, C., Khosrawi, W., and Teichmann, J. (2020). A generative adversarial network approach to calibration of local stochastic volatility models. arXiv preprint arXiv:2005.02505 . [Cuchiero et al., 2019] Cuchiero, C., Larsson, M., and Teichmann, J. (2019). Deep neural networks, generic universal interpolation, and controlled odes. arXiv preprint arXiv:1908.07838 . [Gierjatowicz et al., 2020] Gierjatowicz, P., Sabate-Vidales, M., Siska, D., Szpruch, L., and Zuric, Z. (2020). Robust pricing and hedging via neural sdes. Available at SSRN 3646241 . [Heiss et al., 2019] Heiss, J., Teichmann, J., and Wutte, H. (2019). How implicit regularization of neural networks a ff ects the learned function–part i. arXiv preprint arXiv:1911.02903 . [Henry-Labordere, 2019] Henry-Labordere, P. (2019). Generative models for financial data. Available at SSRN 3408007 . [Hernandez, 2016] Hernandez, A. (2016). Model calibration with neural networks. Available at SSRN 2812140 . 43 / 45

Robust pricing and hedging via neural SDEs Lukasz Szpruch - PowerPoint PPT Presentation

Robust pricing and hedging via neural SDEs Lukasz Szpruch University of Edinburgh, The Alan Turing Institute, London joint work with: David Siska, Marc Sabate-Vilades @UoE Zan Zuric and Antoine Jacquier @ICL 1 / 45 Outline Robust pricing

Large deviation principle for SDEs with discontinuous coefficients D. Sobolieva 1 1 Department

Hedging Program Update June 2017 Comprehensive Hedging Program GNL continues to employ a

On Market-Making and Delta-Hedging 1 Market Makers 2 Market-Making and Bond-Pricing On

Statistical Emulators for Pricing and Hedging Longevity Risk Products Jimmy Risk August 6, 2015

A structural risk-neutral model for pricing and hedging power derivatives FiME Research Centre

On robust pricing and hedging and the resulting notions of weak arbitrage Jan Ob l oj

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Hedging under arbitrage Johannes Ruf Columbia University, Department of Statistics AnStAp10

Hedging Default Risks of CDOs in Markovian Hedging Default Risks of CDOs in Markovian Contagion

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Pricing and Hedging of Credit Derivatives via Nonlinear Filtering R udiger Frey Universit

A second order discretization and efficient simulation for Backward SDEs Konstantinos Manolarakis

Computing the quasipotential for nongradient SDEs in 3D Samuel F. Potter Joint work with Maria

Semi-parametric Pricing and Hedging of Volatility and Hybrid Derivatives Peter Carr Based on

Arbitrage-Free Pricing of XVA Motivation Model Hedging Agostino Capponi Arbitrage Columbia

Robust Gate Sizing via Mean- - Robust Gate Sizing via Mean Excess Delay Minimization Excess

Three Essential Building Blocks for Starting a New Restorative Justice Program by Dr. Mark

Introduction Key priority to introduce City Wide Selective Licensing (SL) Scheme Improve

Week 7: August 6, 2017 Review The Kingdom The The The The Partial The Prophesied of God

BEN HOOPER 1870 GOVERNOR OF TENNESSEE 19111915 From nobody to somebody GOODNESS

RESULTS OF PUBLIC CONSULTATION Mtg. 1 De sc r iption R e sult Comme nt R e sponse Atte nde

An early warning system for BGP hijacking attacks Supervisor: Prof. Dr.-Ing. Georg Carle

Top Reasons to Hire an Applicant with a Disability will begin at 2 pm ET About Your Hosts

Student residence: the national and local context Prof Darren Smith (Loughborough) and Prof Phil