Poster: 13 th June, Pacific Ballroom #77 ↑Paper Link Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks Kenta Oono 1,2 Taiji Suzuki 1,3 {kenta_oono, taiji}@mist.i.u-tokyo.ac.jp 1. The University of Tokyo 2. Preferred Networks, Inc. 3. RIKEN AIP Thirty-sixth International Conference on Machine Learning (ICML 2019) June 13 th 2019, Long Beach, CA, U.S.A. Oono and Suzuki, Jun 13th #77 1
Key Takeaway Q. Why ResNet-type CNNs work well? Oono and Suzuki, Jun 13th #77 2
Key Takeaway Q. Why ResNet-type CNNs work well? A. Hidden sparse structure promotes good performance. Oono and Suzuki, Jun 13th #77 3
Problem Setting We consider a non-parametric regression problem: ! = # ° (&) + ) # ° : True function (e.g., Hölder, Barron, Besov class), ) : Gaussian noise Oono and Suzuki, Jun 13th #77 4
Problem Setting We consider a non-parametric regression problem: % = # ° ()) + , # ° : True function (e.g., Hölder, Barron, Besov class), , : Gaussian noise Given ! i.i.d. samples, we pick an estimator " # from the hypothesis class ℱ , which is a set of functions realized by CNNs with a specified architecture. Oono and Suzuki, Jun 13th #77 5
Problem Setting We consider a non-parametric regression problem: ! = # ° (&) + ) # ° : True function (e.g., Hölder, Barron, Besov class), ) : Gaussian noise Given 1 i.i.d. samples, we pick an estimator + # from the hypothesis class ℱ , which is a set of functions realized by CNNs with a specified architecture. Goal: Evaluate the estimation error ℛ + # ∶= - & | + #(&) − # ° (&)| 0 Oono and Suzuki, Jun 13th #77 6
Prior Work / + 1 ℛ " # ≾ inf (∈ℱ ∥ # − #° ∥ . 2(4 ℱ /6) Approximation Error Model Complexity 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture #° : True function (e.g., Hölder, Barron, Besov etc.) 1 2(9) : 2 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 7
Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 8
Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 9
Prior Work 0 + 2 ℛ $ % ≾ inf *∈ℱ ∥ % − %° ∥ / 3(! ℱ /6) Approximation Error Model Complexity CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] 6 : Sample size ℱ: Set of functions realizable by CNNs with a specified architecture %° : True function (e.g., Hölder, Barron, Besov etc.) 2 3(9) : 3 -notation ignoring logarithmic terms. Oono and Suzuki, Jun 13th #77 10
Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints . CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Oono and Suzuki, Jun 13th #77 11
Contribution ResNet-type CNNs can achieve minimax-optimal rates without unrealistic constraints . CNN type Parameter Size ! ℱ Minimax Optimality Discrete Optimization General # of all weights Sub-optimal L - Sparse* # of non-zero weights Optimal J Needed L ResNet # of all weights Optimal J Not Needed J * e.g., Hölder case: [Yarotsuky, 17; Schmidt-Hieber, 17; Petersen & Voigtlaender, 18] Key Observation Known optimal FNNs have block-sparse structures Oono and Suzuki, Jun 13th #77 12
<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Block-sparse FNN Forward w 1 FC σ W 1 , b 1 w M b FC σ W M , b M ) + FC & (.) − 1 FNN ∶= % * & &'( Oono and Suzuki, Jun 13th #77 13
<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> Block-sparse FNN Forward w 1 FC σ W 1 , b 1 w M b FC σ W M , b M ) + FC & (.) − 1 FNN ∶= % * & &'( Known best approximating FNNs are block-sparse when the true function is --- Barron [Klusowski & Barron, 18] Hölder [Yarotsky, 17; Schmidt-Hieber, 17] Besov [Suzuki, 19]. Oono and Suzuki, Jun 13th #77 14
Recommend
More recommend