Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties Philipp Christian Petersen
Joint work Joint work with: ◮ Helmut B¨ olcskei (ETH Z¨ urich) ◮ Philipp Grohs (University of Vienna) ◮ Joost Opschoor (ETH Z¨ urich) ◮ Gitta Kutyniok (TU Berlin) ◮ Mones Raslan (TU Berlin) ◮ Christoph Schwab (ETH Z¨ urich) ◮ Felix Voigtlaender (KU Eichst¨ att-Ingolstadt) 1 / 36
Today’s Goal Goal of this talk: Discuss the suitability of neural networks as an ansatz system for the solution of PDEs. Two threads: Approximation theory: Structural properties: ◮ non-convex, non-closed ◮ universal approximation ansatz spaces ◮ optimal approximation ◮ parametrization not stable rates for all classical ◮ very hard to optimize over function spaces ◮ reduced curse of dimen- sion 1 1 0.5 0 0 1 -1 0.8 1 1 0.6 0.8 0.8 0.6 0.6 0.4 0 0.2 0.4 0.4 0.4 0.2 0.2 0.2 0.6 0 0.8 0 0 1 2 / 36
Outline Neural networks Introduction to neural networks Approaches to solve PDEs Approximation theory of neural networks Classical results Optimality High-dimensional approximation Structural results Convexity Closedness Stable parametrization 3 / 36
Neural networks We consider neural networks as a special kind of functions: ◮ d = N 0 ∈ N : input dimension , ◮ L : number of layers , ◮ ̺ : R → R : activation function , ◮ T ℓ : R N ℓ − 1 → R N ℓ , ℓ = 1 , . . . , L : affine-linear maps. Then Φ ̺ : R d → R N L given by Φ ̺ ( x ) = T L ( ̺ ( T L − 1 ( ̺ ( . . . ̺ ( T 1 ( x )))))) , x ∈ R d , is called a neural network (NN) . The sequence ( d , N 1 , . . . , N L ) is called the architecture of Φ ̺ . 4 / 36
Why are neural networks interesting? - I Deep Learning: Deep learning describes a variety of techniques based on data-driven adaptation of the affine linear maps in a neural network. Overwhelming success: ◮ Image classification ◮ Text understanding Ren, He, Girshick, Sun; 2015 ◮ Game intelligence � Hardware design of the future! 5 / 36
Why are neural networks interesting? - II Expressibility: Neural networks constitute a very powerful architecture. Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999) Let d ∈ N , K ⊂ R d compact, f : K → R continuous, ̺ : R → R continuous and not a polynomial. Let ε > 0 , then there exist a two-layer NN Φ ̺ : � f − Φ ̺ � ∞ ≤ ε. Efficient expressibility: R M ∋ θ �→ ( T 1 , . . . , T L ) �→ Φ ̺ θ yields a parametrized system of functions. In a sense this parametrization is optimally efficient. (More on this below). 6 / 36
How can we apply NNs to solve PDEs? PDE problem: For D ⊂ R d , d ∈ N find u such that G ( x , u ( x ) , ∇ u ( x ) , ∇ 2 u ( x )) = 0 for all x ∈ D . Approach of [Lagaris, Likas, Fotiadis; 1998]: Let ( x i ) i ∈ I ⊂ D , find a NN Φ ̺ θ such that G ( x i , Φ ̺ θ ( x i ) , ∇ Φ ̺ θ ( x i ) , ∇ 2 Φ ̺ θ ( x i )) = 0 for all i ∈ I . Standard methods can be used to find parameters θ . 7 / 36
Approaches to solve PDEs - Examples General Framework: Deep Ritz Method [E, Yu; 2017]: NNs as trial functions, SGD naturally replaces quadrature. High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let D ⊂ R d d ≥ 100 find u such that ∂ u ∂ t ( t , x ) + H ( u )( t , x ) = 0 , ( t , x ) ∈ [0 , T ] × Ω , + BC + IC As the number of parameters of the NNs increases the minimizer of associated energy approaches true solution. No mesh generation required! [Berner, Grohs, Hornung, Jentzen, von Wurstemberger; 2017]: Phrasing problem as empirical risk minimization � provably no curse of dimension in approximation problem or number of samples. 8 / 36
How can we apply NNs to solve PDEs? Deep learning and PDEs: Both approaches above are based on two ideas. ◮ Neural networks are highly efficient in representing solutions of PDEs, hence the complexity of the problem can be greatly reduced. ◮ There exist black box methods from machine learning that solve the optimization problem. This talk: ◮ We will show exactly how efficient the representations are. ◮ Raise doubt that the black box can produce reliable results in general. 9 / 36
Approximation theory of neural networks 10 / 36
Complexity of neural networks Recall: Φ ̺ ( x ) = T L ( ̺ ( T L − 1 ( ̺ ( . . . ̺ ( T 1 ( x )))))) , x ∈ R d . Each affine linear mapping T ℓ is defined by a matrix A ℓ ∈ R N ℓ × N ℓ − 1 and a translation b ℓ ∈ R N ℓ via T ℓ ( x ) = A ℓ x + b ℓ . The number of weights W (Φ ̺ ) and the number of neurons N (Φ ̺ ) are L � � W (Φ ̺ ) = N (Φ ̺ ) = ( � A j � ℓ 0 + � b j � ℓ 0 ) and N j . j ≤ L j =0 11 / 36
Power of the architecture — Exemplary results Given f from some class of functions, how many weights/neurons does an ε -approximating NN need to have? 12 / 36
Power of the architecture — Exemplary results Given f from some class of functions, how many weights/neurons does an ε -approximating NN need to have? Not so many... Theorem (Maiorov, Pinkus; 1999) There exists an activation function ̺ weird : R → R that ◮ is analytic and strictly increasing, ◮ satisfies lim x →−∞ ̺ weird ( x ) = 0 and lim x →∞ ̺ weird ( x ) = 1 , such that for any d ∈ N , any f ∈ C ([0 , 1] d ) , and any ε > 0 , there is a 3 -layer ̺ -network Φ ̺ weird with � f − Φ ̺ weird � L ∞ ≤ ε and ε ε N (Φ ̺ weird ) = 9 d + 3 . ε 12 / 36
Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. 13 / 36
Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). 13 / 36
Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). ◮ Yarotsky; 2017 : For f ∈ C s ([0 , 1] d ), we have for ̺ ( x ) = x + (called ReLU) that � f − Φ ̺ n � L ∞ � W (Φ ̺ n ) − s / d and L (Φ ̺ ε ) ≍ log( n ). 13 / 36
Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). ◮ Yarotsky; 2017 : For f ∈ C s ([0 , 1] d ), we have for ̺ ( x ) = x + (called ReLU) that � f − Φ ̺ n � L ∞ � W (Φ ̺ n ) − s / d and L (Φ ̺ ε ) ≍ log( n ). ◮ Shaham, Cloninger, Coifman; 2015 : One can implement certain wavelets using 4–layer NNs. 13 / 36
Power of the architecture — Exemplary results ◮ Barron; 1993 : Approximation rate for functions with one finite Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero. ◮ Mhaskar; 1993 : Let ̺ be sigmoidal function of order k ≥ 2. n ) − s / d and For f ∈ C s ([0 , 1] d ), we have � f − Φ ̺ n � L ∞ � N (Φ ̺ L (Φ ̺ n ) = L ( d , s , k ). ◮ Yarotsky; 2017 : For f ∈ C s ([0 , 1] d ), we have for ̺ ( x ) = x + (called ReLU) that � f − Φ ̺ n � L ∞ � W (Φ ̺ n ) − s / d and L (Φ ̺ ε ) ≍ log( n ). ◮ Shaham, Cloninger, Coifman; 2015 : One can implement certain wavelets using 4–layer NNs. ◮ He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019 : ReLU NNs reproduce approximation rates of h -, p - and hp -FEM . 13 / 36
Lower bounds Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based on ̺ weird ). Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺ weird . ( � VC dimension bounds ) (B) Place restrictions on the weights . ( � Information theoretical bounds, entropy arguments ) (C) Use still other concepts like continuous N -widths . 14 / 36
Lower bounds Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based on ̺ weird ). Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺ weird . ( � VC dimension bounds ) (B) Place restrictions on the weights . ( � Information theoretical bounds, entropy arguments ) (C) Use still other concepts like continuous N -widths . 14 / 36
Recommend
More recommend