Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.
Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.
Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits.
Data Generating Distribution Suppose that there exists a probability distribution on R 784 that randomly generates handwritten digits. � Variational Autoencoder Demo
A New Look Suppose that our training data consists of samples according to a given data distribution ( X , Y )
A New Look Suppose that our training data consists of samples according to a given data distribution ( X , Y )
A New Look If we knew the data distribution ( X , Y ), the best functional relation between X and Y would simply be E [ Y | X = x ]!
A New Look If we knew the data distribution ( X , Y ), the best functional relation between X and Y would simply be E [ Y | X = x ]!
A New Look But we only have samples and do not know the distribution ( X , Y )
A New Look But we only have samples and do not know the distribution ( X , Y )
A New Look But we only have samples and do not know the distribution ( X , Y ) A mathematical learning problem seeks to infer the regression function E [ Y | X = x ] from random samples ( x i , y i ) m i =1 of ( X , Y ) .
Mathematical Formulation
Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E .
Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E . We have ˆ U ( x ) = E [ Y | X = x ] .
Mathematical Formulation Let (Ω , F , P ) be a probability space and let X : Ω → R d and Y : Ω → R n be random vectors. Find the best functional U : R d → R n between these vectors in the sense relationship ˆ that � ˆ | U ( X ( ω )) − Y ( ω ) | 2 d P ( ω ) U = argmin U : R d → R n Ω | U ( X ) − Y | 2 � � = argmin U : R d → R n E . We have ˆ U ( x ) = E [ Y | X = x ] . ˆ U is called the regression function .
Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1
Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1 Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C ( R d , R k ) and computes the empirical regression function ˆ U H , z ∈ argmin E z ( U ) . U ∈H
Statistical Learning Theory � � Let z = ( x 1 , y 1 ) , . . . , ( x m , y m ) be m realizations of samples independently drawn according to ( X , Y ). For a function U : R d → R k define the empirical risk of U by m E z ( U ) = 1 � | U ( x i ) − y i | 2 . m i =1 Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C ( R d , R k ) and computes the empirical regression function ˆ U H , z ∈ argmin E z ( U ) . U ∈H Example H = { Polynomials of degree ≤ p } .
Degree too low: underfitting. Degree to high: overfitting!
Figure: Error with Polynomial Degree
Bias-Variance-Problem “Capacity” of the hypothesis space has to be adapted to the complexity of the target function and the sample size! Figure: Error with Polynomial Degree
Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2
Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize .
Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize . Main Theorem [e.g., Cucker-Zhou (2007)] If m � ln( N ( H , c · η )) (and very strong conditions hold), then η 2 ǫ generalize ≤ η w.h.p. where N ( H , s ) is the s -covering number of H w.r.t. L ∞ .
Bias-Variance Decomposition Let ( X , Y ) data generating r.v.’s and ˆ U the regression function. Let i =1 i.i.d. samples, H a hypothesis class and ˆ z = ( x i , y i ) m U H , z the empirical regression function. We seek to understand the error ǫ := E ( ˆ U H , z ) − E ( ˆ U ) = E | ˆ U H , z ( X ) − ˆ U ( X ) | 2 Bias-Variance Decomposition Let U H := argmin U ∈H E | U ( X ) − ˆ U ( X ) | 2 , U ( X ) | 2 the approximation error and ǫ approx := E | U H ( X ) − ˆ Problems for Data Science Applications: ǫ generalize := E ( U H , z ) − E ( U H ) the generalization error . Then ǫ = ǫ approx + ǫ generalize . Assumption that data is iid is debatable Different asymptotic regime in deep learning (where often Main Theorem [e.g., Cucker-Zhou (2007)] # DOFs >> # training samples) If m � ln( N ( H , c · η )) (and very strong conditions hold), then Without knowing P ( X , Y ) it is impossible to control the η 2 ǫ generalize ≤ η w.h.p. where N ( H , s ) is the s -covering number of H approximation error. w.r.t. L ∞ .
PDEs as Learning Problems
Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4.
Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y )
Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] ,
Explicit Solution of Heat Equation if g = 0 Let u ( t , x ) satisfy ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .
Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) In other words u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .
Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) Contrary to conventional ML problems, the data dis- In other words tribution is now explicitly known – we can simulate as much u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , training data as we want! In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] .
Explicit Solution of Heat Equation if g = 0 The solution u ( t , x ) of the PDE can be interpreted Let u ( t , x ) satisfy as solution to the learning problem with data distribution ∂ 2 u ∂ 2 u ∂ 2 u ∂ u ( X , Y ) , where X ∼ U [ u , v ] 3 and Y = ϕ ( Z X t ) and Z X ∼ ∂ t ( t , x ) = + + , u (0 , x ) = ϕ ( x ) t ∂ x 1 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3 ∂ x 3 N ( x , t 1 / 2 I ) ! t ∈ (0 , ∞ ) , x ∈ R 3 ; d = 4. Then 1 � (4 π t ) 3 / 2 exp( −| x − y | 2 / 4 t ) dy . u ( t , x ) = R 3 ϕ ( y ) Contrary to conventional ML problems, the data dis- In other words tribution is now explicitly known – we can simulate as much u ( t , x ) = E [ ϕ ( Z x Z x t ∼ N ( x , t 1 / 2 I ) . t )] , training data as we want! In other words, for x ∈ [ u , v ] 3 and X ∼ U [ u , v ] 3 and Y = ϕ � Z X � we t have u ( t , x ) = E [ Y | X = x ] . We will see in a minute that similar properties hold for a much more general class of PDEs!
Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) .
Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) . Examples include convection-diffusion equations and Black-Scholes Equation.
Linear Kolmogorov Equations Given Σ : R d → R d × d , µ : R d → R d and initial value ϕ : R d → R , find u : R + × R d → R with ∂ u ∂ t ( t , x ) = 1 � � Σ( x )Σ T ( x ) Hess x u ( t , x ) 2 Trace + µ ( x ) · ∇ x u ( t , x ) , ( t , x ) ∈ [0 , T ] × R d , u (0 , x ) = ϕ ( x ) . Examples include convection-diffusion equations and Black-Scholes Equation. Standard methods such as sparse grid methods, sparse tensor product methods, spectral methods, finite element methods or finite difference methods are incapable of solving such equations in high dimensions ( d = 100)!
Special Case: Pricing of Financial Derivatives
Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 .
Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0
Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0 (Black-Scholes (1973)): in the absence of correlations the portfolio-value u ( t , x ) satisfies � ∂ � ∂ � ∂ 2 d d + σ 2 � u ( t , x ) + µ � � � � | x i | 2 x i u ( t , x ) u ( t , x ) = 0 , ∂ x 2 ∂ t 2 ∂ x i 2 i i =1 i =1 u ( T , x ) = G ( x ) .
Special Case: Pricing of Financial Derivatives Given a portfolio consisting of d assets with value ( x i ( t )) d i =1 . European Max Option: At time T , exercise option and receive � � d G ( x ) := max max i =1 ( x i − K i ) , 0 (Black-Scholes (1973)): in the absence of correlations the portfolio-value u ( t , x ) satisfies � ∂ � ∂ � ∂ 2 d d + σ 2 � u ( t , x ) + µ � � � � | x i | 2 x i u ( t , x ) u ( t , x ) = 0 , ∂ x 2 ∂ t 2 ∂ x i 2 i i =1 i =1 u ( T , x ) = G ( x ) . Pricing Problem: u (0 , x ) =??.
Kolmogorov PDEs as Learning Problems
Kolmogorov PDEs as Learning Problems For x ∈ R d and t ∈ R + let � t � t Z x µ ( Z x Σ( Z x t := x + s ) ds + s ) dW s . 0 0 Then (Feynman-Kac) u ( T , x ) = E ( ϕ ( Z x T )) .
Kolmogorov PDEs as Learning Problems For x ∈ R d and t ∈ R + let � t � t Z x µ ( Z x Σ( Z x t := x + s ) ds + s ) dW s . 0 0 Then (Feynman-Kac) u ( T , x ) = E ( ϕ ( Z x T )) . Lemma (Beck-Becker-G-Jafaari-Jentzen (2018)) X ) . The solution ˆ Let X ∼ U [ a , b ] d and let Y = ϕ ( Z T U of the mathematical learning problem with data distribution ( X , Y ) is given by ˆ x ∈ [ a , b ] d , U ( x ) = u ( T , x ) , where u solves the corresponding Kolmogorov equation.
Solving linear Kolmogorov Equations by means of Neural Network Based Learning
The Vanilla DL Paradigm
The Vanilla DL Paradigm Every image is given as a 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 :
The Vanilla DL Paradigm Every image is given as a 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : Every label is given as a 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit
The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Every label is given as a 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit
The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ of each digit
The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ The learning goal is to of each digit find the empirical regression function f z ∈ H σ (784 , 30 , 30 , 10) .
The Vanilla DL Paradigm Every image is given as a Given labeled training 28 × 28 matrix x ∈ R 28 × 28 ∼ R 784 : data i =1 ⊂ R 784 × R 10 . ( x i , y i ) m Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons Every label is given as a ( N 1 = 30 , N 2 = 30). 10-dim vector y ∈ R 10 describing the ‘probability’ The learning goal is to of each digit find the empirical regression function f z ∈ H σ (784 , 30 , 30 , 10) . Typically solved by stochastic first order optimization methods.
Description of Image Content ImageNet Challenge
Deep Learning Algorithm
Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme.
Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme. 2. Apply the Deep Learning Paradigm to this training data
Deep Learning Algorithm iid 1. Generate training data z = ( x i , y i ) m ∼ ( X , ϕ ( Z T X )) by i =1 simulating Z T X with the Euler-Maruyama scheme. 2. Apply the Deep Learning Paradigm to this training data ...meaning that (i) we pick a network architecture ( N 0 = d , N 1 , . . . , N L = 1), and let H = H σ ( N 0 ,..., N L ) and (ii) attempt to approximately compute m 1 ˆ � ( U ( x i ) − y i ) 2 U H , z = argmin m U ∈H i =1 in Tensorflow.
Recommend
More recommend