UAT: From Shallow to Deep Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities January 30, 2020 1 / 22
Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) 2 / 22
Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong 2 / 22
Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) 2 / 22
Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) 2 / 22
Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) 2 / 22
Logistics – L A T EX source of homework posted in Canvas (Thanks to Logan Stapleton!) Mind L A T EX! Mind your math! * Ten Signs a Claimed Mathematical Breakthrough is Wrong * Paper Gestalt ( 50% / 18% , 2009) = ⇒ Deep Paper Gestalt ( 50% / 0 . 4% , 2018) – Matrix Cookbook? Yes and No 2 / 22
Outline Recap and more thoughts From shallow to deep NNs 3 / 22
Supervised learning as function approximation – Underlying true function: f 0 – Training data: y i ≈ f 0 ( x i ) – Choose a family of functions H , so that ∃ f ∈ H and f and f 0 are close – Approximation capacity : H matters (e.g., linear? quadratic? sinusoids? etc) – Optimization & Generalization : how to find the best f ∈ H matters We focus on approximation capacity now. 4 / 22
Approximation capacities of NNs – A single neuron has limited capacity 5 / 22
Approximation capacities of NNs – A single neuron has limited capacity – Deep NNs with linear activation is no better 5 / 22
Approximation capacities of NNs – A single neuron has limited capacity – Deep NNs with linear activation is no better – Add in both depth and nonlinearity activation universal approximation theorem The 2-layer network can approximate arbitrary continuous functions arbitrarily well, provided that the hidden layer is sufficiently wide . two-layer network, linear activation at output 5 / 22
[A] universal approximation theorem (UAT) Theorem (UAT, [Cybenko, 1989, Hornik, 1991]) Let σ : R → R be a nonconstant, bounded, and continuous function. Let I m denote the m -dimensional unit hypercube [0 , 1] m . The space of real-valued continuous functions on I m is denoted by C ( I m ) . Then, given any ε > 0 and any function f ∈ C ( I m ) , there exist an integer N , real constants v i , b i ∈ R and real vectors w i ∈ R m for i = 1 , . . . , N , such that we may define: N � � � w T F ( x ) = v i σ i x + b i i =1 as an approximate realization of the function f ; that is, | F ( x ) − f ( x ) | < ε for all x ∈ I m . 6 / 22
Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? 7 / 22
Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions 7 / 22
Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form 7 / 22
Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? 7 / 22
Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? change to matrix-vector notation for convenience � F ( x ) = w ⊺ σ ( W 2 σ ( W 1 x + b 1 ) + b 2 ) as w k g k ( x ) k use w k ’s to linearly combine the same function – For geeks : approximate both f and f ′ ? 7 / 22
Thoughts – Approximate continuous functions with vector outputs, i.e., I m → R n ? think of the component functions – Map to [0 , 1] , {− 1 , +1 } , [0 , ∞ ) ? choose appropriate activation σ at the output � N � � � w T � F ( x ) = σ v i σ i x + b i i =1 ... universality holds in modified form – Get deeper? three-layer NN? change to matrix-vector notation for convenience � F ( x ) = w ⊺ σ ( W 2 σ ( W 1 x + b 1 ) + b 2 ) as w k g k ( x ) k use w k ’s to linearly combine the same function – For geeks : approximate both f and f ′ ? check out [Hornik et al., 1990] 7 / 22
Learn to take square-root 8 / 22
Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R 8 / 22
Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also 8 / 22
Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also – To invert, what to output? What if just throw in the training data? 8 / 22
Learn to take square-root Suppose we lived in a time square-root is not defined ... � x i , x 2 � – Training data: i , where i x i ∈ R – Forward: if x �→ y , − x �→ y also – To invert, what to output? What if just throw in the training data? 8 / 22
Visual “proof” of UAT 9 / 22
What about ReLU? ReLU difference of ReLU’s 10 / 22
What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? 10 / 22
What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? 10 / 22
What about ReLU? ReLU difference of ReLU’s what happens when the slopes of the ReLU’s are changed? How general σ can be? ... enough when σ not a polynomial [Leshno et al., 1993] 10 / 22
Outline Recap and more thoughts From shallow to deep NNs 11 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 12 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? 12 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ | x − y | , ∀ x, y ∈ R 12 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 1 D ? Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ | x − y | , ∀ x, y ∈ R For ε accuracy, need 1 ε bumps 12 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 13 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Visual proof in 2D first σ ( w ⊺ x + b ) , σ sigmod 13 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Visual proof in 2D first σ ( w ⊺ x + b ) , σ sigmod approach 2D step function when making w large Credit: CMU 11-785 13 / 22
Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22
Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22
Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... 14 / 22
Visual proof for 2D functions Keep increasing the number of step functions that are distributed evenly ... Image Credit: CMU 11-785 14 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? 15 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Image Credit: CMU 11-785 15 / 22
What’s bad about shallow NNs? From UAT, “... there exist an interger N, ...”, but how large? What happens in 2 D ? Image Credit: CMU 11-785 Assume the target f is 1 -Lipschitz, i.e., | f ( x ) − f ( y ) | ≤ � x − y � 2 , ∀ x , y ∈ R 2 15 / 22
Recommend
More recommend