Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search ○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University) Kouhei Nishida (Shinshu University)
Neural Architecture Neural Network Architectures often pre-trained on some datasets VGGNet ResNet Inception … Task (Dataset) Trial and Error! Sometimes... • a known architecture works well on our tasks. Happy! Other times... • Find a good one • Design a brand-new architecture and train it � 2
<latexit sha1_base64="HZyaidN5WsSwoTEAv8/MtN9ob3w=">ACtHicbZHfb9MwEMedjB+jbNDBIy8W1aCUJUMBLwgTfDC45DoNqkuleNcWjM7juzLusrKH8oD/wtOG01bx0mWPvqe7873dVYp6TBJ/kTxzoOHjx7vPuk93dt/9rx/8OLMmdoKGAujL3IuAMlSxijRAUXlQWuMwXn2eW3Nn9+BdZJU/7EVQVTzelLKTgGKRZf3nENL+eZpL5qGFsOWls2vt8ON9OYd7YAy1jtiCNfoXZ39BoEUTcPodgH9Qhm385u2y1tb5rN+oNklKyD3oe0gwHp4nR2EB2y3IhaQ4lCcecmaVLh1HOLUihoeqx2UHFxyecwCVhyDW7q1w419DAoOS2MDadEulZvV3iunVvpLNzUHBduO9eK/8tNaiw+T70sqxqhFJtBRa2CM7S1m+bSBqPUKgAXVoa3UrHglgsMn3JnSqbDpU1VzIHYbTmZe47HxvPTAWo7HtWkuJCyW1ROe7fJhLN9gLxqbNt6Hs+NR+n50/OPD4ORrZ/EueUVekyFJySdyQr6TUzImgvyNdqK9aD/+GLNYxLC5GkdzUtyJ+LyH1NQ1xc=</latexit> <latexit sha1_base64="vyiD93EaWMADRqvRUYSVe0cJzgA=">ACH3icbVC7TsMwFHXKq5RXgZHFoqoEqSMsCECgwFok+pCaqHNdprdpJZDtAFeUT+Ah+gIUVZjbESEc2PgP3MdCWI13do3Pu1bWPGzIqlWkOjNTC4tLySno1s7a+sbmV3d6pySASmFRxwALRcJEkjPqkqhipBEKgrjLSN3tXQ79+h0Rkgb+reqHxOGo41OPYqS01Moe2Bw9tGLb5fF9cgSHScJ9ArTymErmzOL5ghwnlgTkiufXT3nz79+Kq3st90OcMSJrzBDUjYtM1ROjISimJEkY0eShAj3UIc0NfURJ9KJRx9KYF4rbegFQpev4Ej9uxEjLmWfu3qSI9WVs95Q/M9rRso7dWLqh5EiPh4f8iIGVQCH6cA2FQr1tcEYUH1WyHuIoGw0hlOXF5ojOxZhOYJ7VS0Toulm50OBdgjDTYA/ugACxwAsrgGlRAFWDwCF7AK3gznox348P4HI+mjMnOLpiCMfgFCE6nCA=</latexit> … … One-Shot Neural Architecture Search Joint Optimization of Architecture c and Weights w Conv 3 x 3 0 NAS as hyper-parameter search W 1 1 c evaluation = 1 training Conv 5 x 5 max f ( w ∗ ( c ) , c ) 0 W 2 + c max x t x t+1 subject to w ∗ ( c ) = argmax f ( w , c ) 1 pooling w avg 0 One-shot NAS pooling optimization of x and c within 1 training max w , c f ( w , c ) w : (W1, W2) c : (0, 0, 1, 0) � 3
<latexit sha1_base64="HVsL6MfL3cRx/YNlI3wm1Sd7FAs=">ADBXichVFNb9NAEF2brxI+msKRy4qoqBUosgsSHCu4cCxS01bKmi8WSer7nqt3XFDZPnMr+GuHLlL/TfdGMbaBMkRlrpvTczejszaGkwyi6DMJbt+/cvbd1v/fg4aPH2/2dJyfOlJaLETfK2LMUnFAyFyOUqMRZYQXoVInT9PzDKn96IayTJj/GZSESDbNcZpIDemnS/8VSXS1q+oIpkSFYaxa0k15SJgonlS+rWqlmOaQK/tCM7rXoVdPD64YynAuEen+fsd41vmHxW1/36fS/Zi3/j9mkP4iGURN0E8QdGJAujiY7wS6bGl5qkSNX4Nw4jgpMKrAouRJ1j5VOFMDPYSbGHuaghUuqZuU13fXKlGbG+pcjbdTrHRVo5Y69ZUacO7WcyvxX7lxidm7pJ5UaLIeWuUlYqioav70am0gqNaegDcSv9XyudgaO/8g2XVPsZCmsu5FRwozXk04qBnWn4UlfMFMICGrsayFxrqSW6Kou731pC3t+sfH6GjfBycEwfj08+PRmcPi+W/EWeUaekz0Sk7fkHwkR2REeDAMjoMk+Bx+Db+F38MfbWkYdD1PyY0If14BKe/6VA=</latexit> Difficulties for Practitioners How to choose / tune the search strategy? Search Space Search Strategy Gradient-Based Method w w + ✏ w r w f ( w , c ( θ )) θ θ + ✏ θ r θ f ( w , c ( θ )) hyper-parameter: step-size Other Choices • Evolutionary Computation Based • Reinforcement Learning Based - how to treat integer variables such as #filters? - how to tune the hyper-parameters in such situations? � 4
<latexit sha1_base64="y6/6GQgAGnSQUJ1h9XWy4NyciY=">ACyXicjVHLThsxFHWGtlBoS4AlGwtUCUQVzdAFLFhEdIPKhkoNIGVC5PHcEAs/pvYdQmrNil/od/RrumHRfkudCVULZNErWTo651zfV1ZI4TCO7xrR3LPnL+YXi4uvXr9Zrm5snrqTGk5dLiRxp5nzIEUGjoUMJ5YGpTMJZdvVhop9dg3XC6M84LqCn2KUWA8EZBqrfPq4lY5GFx6rdzRFxAnapgf0D72T/IcQ8Ha/uRm34jroU5Dcg832Rrz7a49PumvNESaG14q0Mglc6bxAX2PLMouIRqMS0dFIxfsUvoBqiZAtfz9cgVfRuYnA6MDU8jrdl/MzxTzo1VFpyK4dA91ibkLK1b4mC/54UuSgTNp4UGpaRo6GR/NBcWOMpxAIxbEXqlfMgs4xi2/KBKpsIMhTXIgdulGI692F3lfdp3a7PZHBWab8qKpmeDn/67WQ104+0xnOMOvbFIeALGSE4ySPT/EUnO62kvet3U/hSodkGgtknWyQLZKQPdImR+SEdAgn38kP8pP8io6jL9FN9HVqjRr3OWvkQUS3vwEFZe4</latexit> <latexit sha1_base64="HLe5ikCOIzFH1404FMTVwWAgKc=">ADSnicbVFdb9MwFHWyAaN8dfDIi0UFKhqUZCDBC9IEkI8DYluk+q2chxntWo7kX2zqrLyA5H4A/wN3hAvOGkY69orWbo+95x7XuSQgoLUfQzCHd2b9y8tXe7c+fuvfsPuvsPT2xeGsaHLJe5OUuo5VJoPgQBkp8VhlOVSH6azD/W9dMLbqzI9TdYFnys6LkWmWAUPDTtfieLxcTBQVzhZ+/x6lLhA0x4YX0DOexagK+JFI+o+CIpomkKx/6beSF5gAQJ09rwjptJd/bdvSel8PNo0VhVmSuU9V/7LFxL30ys2RteRypm9+deq024sGURN4M4nbpIfaOJ7uB4KkOSsV18AktXYURwWMHTUgmORVh5SWF5TN6Tkf+VRTxe3YNTuv8FOPpDjLjT8acINeVTiqrF2qxDPrD9rtRrcVhuVkL0bO6GLErhmq0FZKTHkuDYQp8JwBnLpE8qM8G/FbEYNZeBtXpuSKP+HwuQXfo0sV4rqtHNOdI81/mlsnlFEuUWVbWFy9h/ruFpw2RbmbUxW9oSmHGgXuHNia9bsZmcHA7i14PDr296Rx9am/bQY/QE9VGM3qIj9BkdoyFiwatgGEyCafgj/BX+Dv+sqGHQah6htdjZ/Qt0fBdr</latexit> <latexit sha1_base64="ei1X4bv0b/x2an2mpDxfz7crXD4=">AC93icbVLdbtMwFHbC3xb+OrjkxmJC6iRUJZsQ0yS0CW4QVwPRbVJdVY7jtFbtOLJPVqrgh+AJuEPc8jwDrwDTlqdeuRLH8+5zvy8fc5LaWwEMe/g/DW7Tt3721tR/cfPHz0uLPz5MzqyjDeZ1pqc5FSy6UoeB8ESH5RGk5VKvl5On3X1M8vubFCF59hXvKhouNC5IJR8KlRB4iX0Y1mc1eYsKYw3n3P97DEfkxhOgxuiZP1xhAoDH7qrwx4+eoOJKGCtv+z6jSiRLTkZSVXNXDTq7Ma9uA18EyRLsHty3Ht1/PXvt9PRTjAlmWaV4gUwSa0dJHEJw5oaExyF5HK8pKyKR3zgYcFVdwO61Yeh1/4TIZzbfzyE7bZqx01VdbOVeqZisLEXq81yU21QX54bAWRVkBL9jiorySGDRutMaZMJyBnHtAmRF+Vswm1FAG3pG1W1Ll31AafSkyzrRStMgapV1dk3bcOpWe6Rr9Zs5t4HrVlzDM7dQehOzsW5F9b/BS0pbOoEJ9A5b09y3Yyb4Gy/lxz09j96n96iRWyhZ+g56qIEvUYn6D06RX3E0J8ABdtBFM7D7+GP8OeCGgbLnqdoLcJf/wAWwfXD</latexit> Contributions Novel Search Strategy for One-shot NAS 1. arbitrary search space (categorical + ordinal) 2. robust against its inputs (hyper-param. and search space) Our approach 1. Stochastic Relaxation exponential family Z max w , c f ( w , c ) ⇒ max w , θ J ( w , θ ) := f ( w , c ) p ( c | θ ) d c differentiable w.r.t. w and θ 2. Stochastic Natural Gradient + Adaptive Step-Size w t +1 = w t + ✏ t \ r w J ( w t , θ t ) w θ t +1 = θ t + ✏ t \ θ F ( θ t ) − 1 r θ J ( w t +1 , θ t ) Natural Gradient Under appropriate step-size J ( w t , θ t ) < J ( w t +1 , θ t ) < J ( w t +1 , θ t +1 ) Monotone Improvement � 5
Results and Details • Faster & Competitive Accuracy to other one-shot NAS The detail will be explained at Poster #53 � 6
Recommend
More recommend