Adaptive Stochastic Natural Gradient Method for One-Shot Neural - PowerPoint PPT Presentation

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search ○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University) Kouhei Nishida (Shinshu University)

Neural Architecture Neural Network Architectures often pre-trained on some datasets VGGNet ResNet Inception … Task （Dataset） Trial and Error! Sometimes... • a known architecture works well on our tasks. Happy! Other times... • Find a good one • Design a brand-new architecture and train it � 2

<latexit sha1_base64="HZyaidN5WsSwoTEAv8/MtN9ob3w=">ACtHicbZHfb9MwEMedjB+jbNDBIy8W1aCUJUMBLwgTfDC45DoNqkuleNcWjM7juzLusrKH8oD/wtOG01bx0mWPvqe7873dVYp6TBJ/kTxzoOHjx7vPuk93dt/9rx/8OLMmdoKGAujL3IuAMlSxijRAUXlQWuMwXn2eW3Nn9+BdZJU/7EVQVTzelLKTgGKRZf3nENL+eZpL5qGFsOWls2vt8ON9OYd7YAy1jtiCNfoXZ39BoEUTcPodgH9Qhm385u2y1tb5rN+oNklKyD3oe0gwHp4nR2EB2y3IhaQ4lCcecmaVLh1HOLUihoeqx2UHFxyecwCVhyDW7q1w419DAoOS2MDadEulZvV3iunVvpLNzUHBduO9eK/8tNaiw+T70sqxqhFJtBRa2CM7S1m+bSBqPUKgAXVoa3UrHglgsMn3JnSqbDpU1VzIHYbTmZe47HxvPTAWo7HtWkuJCyW1ROe7fJhLN9gLxqbNt6Hs+NR+n50/OPD4ORrZ/EueUVekyFJySdyQr6TUzImgvyNdqK9aD/+GLNYxLC5GkdzUtyJ+LyH1NQ1xc=</latexit> <latexit sha1_base64="vyiD93EaWMADRqvRUYSVe0cJzgA=">ACH3icbVC7TsMwFHXKq5RXgZHFoqoEqSMsCECgwFok+pCaqHNdprdpJZDtAFeUT+Ah+gIUVZjbESEc2PgP3MdCWI13do3Pu1bWPGzIqlWkOjNTC4tLySno1s7a+sbmV3d6pySASmFRxwALRcJEkjPqkqhipBEKgrjLSN3tXQ79+h0Rkgb+reqHxOGo41OPYqS01Moe2Bw9tGLb5fF9cgSHScJ9ArTymErmzOL5ghwnlgTkiufXT3nz79+Kq3st90OcMSJrzBDUjYtM1ROjISimJEkY0eShAj3UIc0NfURJ9KJRx9KYF4rbegFQpev4Ej9uxEjLmWfu3qSI9WVs95Q/M9rRso7dWLqh5EiPh4f8iIGVQCH6cA2FQr1tcEYUH1WyHuIoGw0hlOXF5ojOxZhOYJ7VS0Toulm50OBdgjDTYA/ugACxwAsrgGlRAFWDwCF7AK3gznox348P4HI+mjMnOLpiCMfgFCE6nCA=</latexit> … … One-Shot Neural Architecture Search Joint Optimization of Architecture c and Weights w Conv 3 x 3 0 NAS as hyper-parameter search W 1 1 c evaluation = 1 training Conv 5 x 5 max f ( w ∗ ( c ) , c ) 0 W 2 + c max x t x t+1 subject to w ∗ ( c ) = argmax f ( w , c ) 1 pooling w avg 0 One-shot NAS pooling optimization of x and c within 1 training max w , c f ( w , c ) w : (W1, W2) c : (0, 0, 1, 0) � 3

<latexit sha1_base64="HVsL6MfL3cRx/YNlI3wm1Sd7FAs=">ADBXichVFNb9NAEF2brxI+msKRy4qoqBUosgsSHCu4cCxS01bKmi8WSer7nqt3XFDZPnMr+GuHLlL/TfdGMbaBMkRlrpvTczejszaGkwyi6DMJbt+/cvbd1v/fg4aPH2/2dJyfOlJaLETfK2LMUnFAyFyOUqMRZYQXoVInT9PzDKn96IayTJj/GZSESDbNcZpIDemnS/8VSXS1q+oIpkSFYaxa0k15SJgonlS+rWqlmOaQK/tCM7rXoVdPD64YynAuEen+fsd41vmHxW1/36fS/Zi3/j9mkP4iGURN0E8QdGJAujiY7wS6bGl5qkSNX4Nw4jgpMKrAouRJ1j5VOFMDPYSbGHuaghUuqZuU13fXKlGbG+pcjbdTrHRVo5Y69ZUacO7WcyvxX7lxidm7pJ5UaLIeWuUlYqioav70am0gqNaegDcSv9XyudgaO/8g2XVPsZCmsu5FRwozXk04qBnWn4UlfMFMICGrsayFxrqSW6Kou731pC3t+sfH6GjfBycEwfj08+PRmcPi+W/EWeUaekz0Sk7fkHwkR2REeDAMjoMk+Bx+Db+F38MfbWkYdD1PyY0If14BKe/6VA=</latexit> Difficulties for Practitioners How to choose / tune the search strategy? Search Space Search Strategy Gradient-Based Method w w + ✏ w r w f ( w , c ( θ )) θ θ + ✏ θ r θ f ( w , c ( θ )) hyper-parameter: step-size Other Choices • Evolutionary Computation Based • Reinforcement Learning Based - how to treat integer variables such as #filters? - how to tune the hyper-parameters in such situations? � 4

<latexit sha1_base64="y6/6GQgAGnSQUJ1h9XWy4NyciY=">ACyXicjVHLThsxFHWGtlBoS4AlGwtUCUQVzdAFLFhEdIPKhkoNIGVC5PHcEAs/pvYdQmrNil/od/RrumHRfkudCVULZNErWTo651zfV1ZI4TCO7xrR3LPnL+YXi4uvXr9Zrm5snrqTGk5dLiRxp5nzIEUGjoUMJ5YGpTMJZdvVhop9dg3XC6M84LqCn2KUWA8EZBqrfPq4lY5GFx6rdzRFxAnapgf0D72T/IcQ8Ha/uRm34jroU5Dcg832Rrz7a49PumvNESaG14q0Mglc6bxAX2PLMouIRqMS0dFIxfsUvoBqiZAtfz9cgVfRuYnA6MDU8jrdl/MzxTzo1VFpyK4dA91ibkLK1b4mC/54UuSgTNp4UGpaRo6GR/NBcWOMpxAIxbEXqlfMgs4xi2/KBKpsIMhTXIgdulGI692F3lfdp3a7PZHBWab8qKpmeDn/67WQ104+0xnOMOvbFIeALGSE4ySPT/EUnO62kvet3U/hSodkGgtknWyQLZKQPdImR+SEdAgn38kP8pP8io6jL9FN9HVqjRr3OWvkQUS3vwEFZe4</latexit> <latexit sha1_base64="HLe5ikCOIzFH1404FMTVwWAgKc=">ADSnicbVFdb9MwFHWyAaN8dfDIi0UFKhqUZCDBC9IEkI8DYluk+q2chxntWo7kX2zqrLyA5H4A/wN3hAvOGkY69orWbo+95x7XuSQgoLUfQzCHd2b9y8tXe7c+fuvfsPuvsPT2xeGsaHLJe5OUuo5VJoPgQBkp8VhlOVSH6azD/W9dMLbqzI9TdYFnys6LkWmWAUPDTtfieLxcTBQVzhZ+/x6lLhA0x4YX0DOexagK+JFI+o+CIpomkKx/6beSF5gAQJ09rwjptJd/bdvSel8PNo0VhVmSuU9V/7LFxL30ys2RteRypm9+deq024sGURN4M4nbpIfaOJ7uB4KkOSsV18AktXYURwWMHTUgmORVh5SWF5TN6Tkf+VRTxe3YNTuv8FOPpDjLjT8acINeVTiqrF2qxDPrD9rtRrcVhuVkL0bO6GLErhmq0FZKTHkuDYQp8JwBnLpE8qM8G/FbEYNZeBtXpuSKP+HwuQXfo0sV4rqtHNOdI81/mlsnlFEuUWVbWFy9h/ruFpw2RbmbUxW9oSmHGgXuHNia9bsZmcHA7i14PDr296Rx9am/bQY/QE9VGM3qIj9BkdoyFiwatgGEyCafgj/BX+Dv+sqGHQah6htdjZ/Qt0fBdr</latexit> <latexit sha1_base64="ei1X4bv0b/x2an2mpDxfz7crXD4=">AC93icbVLdbtMwFHbC3xb+OrjkxmJC6iRUJZsQ0yS0CW4QVwPRbVJdVY7jtFbtOLJPVqrgh+AJuEPc8jwDrwDTlqdeuRLH8+5zvy8fc5LaWwEMe/g/DW7Tt3721tR/cfPHz0uLPz5MzqyjDeZ1pqc5FSy6UoeB8ESH5RGk5VKvl5On3X1M8vubFCF59hXvKhouNC5IJR8KlRB4iX0Y1mc1eYsKYw3n3P97DEfkxhOgxuiZP1xhAoDH7qrwx4+eoOJKGCtv+z6jSiRLTkZSVXNXDTq7Ma9uA18EyRLsHty3Ht1/PXvt9PRTjAlmWaV4gUwSa0dJHEJw5oaExyF5HK8pKyKR3zgYcFVdwO61Yeh1/4TIZzbfzyE7bZqx01VdbOVeqZisLEXq81yU21QX54bAWRVkBL9jiorySGDRutMaZMJyBnHtAmRF+Vswm1FAG3pG1W1Ll31AafSkyzrRStMgapV1dk3bcOpWe6Rr9Zs5t4HrVlzDM7dQehOzsW5F9b/BS0pbOoEJ9A5b09y3Yyb4Gy/lxz09j96n96iRWyhZ+g56qIEvUYn6D06RX3E0J8ABdtBFM7D7+GP8OeCGgbLnqdoLcJf/wAWwfXD</latexit> Contributions Novel Search Strategy for One-shot NAS 1. arbitrary search space (categorical + ordinal) 2. robust against its inputs (hyper-param. and search space) Our approach 1. Stochastic Relaxation exponential family Z max w , c f ( w , c ) ⇒ max w , θ J ( w , θ ) := f ( w , c ) p ( c | θ ) d c differentiable w.r.t. w and θ 2. Stochastic Natural Gradient + Adaptive Step-Size w t +1 = w t + ✏ t \ r w J ( w t , θ t ) w θ t +1 = θ t + ✏ t \ θ F ( θ t ) − 1 r θ J ( w t +1 , θ t ) Natural Gradient Under appropriate step-size J ( w t , θ t ) < J ( w t +1 , θ t ) < J ( w t +1 , θ t +1 ) Monotone Improvement � 5

Results and Details • Faster & Competitive Accuracy to other one-shot NAS The detail will be explained at Poster #53 � 6

Adaptive Stochastic Natural Gradient Method for One-Shot Neural - PowerPoint PPT Presentation

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Applications of the Stochastic Gradient Method December 11, 2019 P. Carpentier Master

Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Stochastic Search using the Natural Gradient Ecient Natural Evolution Strategies (eNES) Yi Sun,

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE Juntang Zhuang, Nicha C.

Stochastic Gradient Methods for Neural Networks Chih-Jen Lin National Taiwan University Last

Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al.,

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Project: More Experiments on Stochastic Gradient Methods Last updated: May 25, 2020 May 25, 2020

Part 7.5 Stochastic Gradient Descent and Stochastic Newton 181 Wolfgang Bangerth Background

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade