guided evolutionary strategies
play

Guided Evolutionary Strategies Augmenting random search with - PowerPoint PPT Presentation

Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru Maheswaranathan // Google Research, Brain Team Joint work with: Luke Metz, George Tucker, Dami Choi, Jascha Sohl-dickstein Optimizing with surrogate gradients


  1. Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru Maheswaranathan // Google Research, Brain Team Joint work with: Luke Metz, George Tucker, Dami Choi, Jascha Sohl-dickstein

  2. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased)

  3. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications

  4. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications • Neural networks with non-differentiable layers

  5. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications • Neural networks with non-differentiable layers • Meta-learning (where computing an exact meta-gradient is costly)

  6. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications • Neural networks with non-differentiable layers • Meta-learning (where computing an exact meta-gradient is costly) • Gradients from surrogate models (synthetic gradients, black box attacks)

  7. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased)

  8. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Zeroth-Order only function values, f(x)

  9. Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) First-Order Zeroth-Order gradient information, 𝝰 f(x) only function values, f(x)

  10. <latexit sha1_base64="QXzJ+63yzjnBhE/B24j/45Maj2U=">ACF3icbVDLSgNBEJz1bXxFPXoZDYJ6CLs+15uoBy+igjFCNsrspJMmZ1dZnrFsOQjPKof4028evRbvDiJQXwVDBRV3T3dFSZSGHTdN2dgcGh4ZHRsPDcxOTU9k5+duzBxqjmUeCxjfRkyA1IoKFACZeJBhaFEsph6Drl29AGxGrc2wnUI1YQ4m64AytVL69ylZwtXOdL7jF3d1Nf8unlvjbvrdOvaLbwxcpkD5Or/PvQS3maQKuWTGVDw3wWrGNAouoZMLUgMJ4y3WgIqlikVgqlv3Q5dtkqN1mNtn0LaU793ZCwyph2FtjJi2DS/va74n1dJse5XM6GSFEHxz4/qaQY0+7tCY0cJRtSxjXwu5KeZNpxtEmlAsOwd6i4djOPUlAM4z1WhYw3YiE6tjbGsFi0KU2rT/Z/CUX60Vvo+iebRb29vu5jZEFskRWiEd2yB45IqekRDhpkTvyQB6de+fJeXZePksHnH7PkB5/UDTIugxg=</latexit> Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) First-Order Guided ES Zeroth-Order gradient information, 𝝰 f(x) only function values, f(x) x ( t ) Surrogate gradient

  11. Guided evolutionary strategies Schematic 0.25 Loss 0

  12. Guided evolutionary strategies Schematic 0.25 Loss 0

  13. Guided evolutionary strategies Schematic 0.25 Guiding distribution Loss 0

  14. <latexit sha1_base64="xgziz2qIJBne+iFHmu40YhOTK+Q=">ACNnicdVBNT1NBFJ2H0BRKbpkM1pN0JjmvVIRd0RZsFExWiDpNM1909vHhPl4mZlH0rz0D/BrWIL/xA07wpY9G+a1JVGjJ5nk5Jx79x70lwK5+P4VzR37/6Dh/MLi7WlR4+fLNdXnu45U1iOHW6ksQcpOJRCY8cL/EgtwgqlbifHn2q/P1jtE4Y/cOPcuwpyLQYCg4+SP36S4a5E9JoypxQlCnwhxk+W8Fr+l7LvIFLzu1xtx80M7SdrvaEU2WlOyEa8nMU2a8QNMsNuv37DBoYXCrXnEpzrJnHueyVYL7jEcY0VDnPgR5BhN1ANCl2vnFwzpq+CMqBDY8PTnk7U3ztKUM6NVBoq23d314l/svrFn642SuFzguPmk8/GhaSekOraOhAWORejgIBbkXYlfJDsMB9CLDGtjHcYvFzmPs1Rwve2DclA5spocfhtow9ZxUNad1FQv9P9lrNZL3Z+tZubH2c5bZAVskLskYS8p5skR2ySzqEkxNySs7Jz+gsuoguo6tp6Vw063lG/kB0fQsJGqxU</latexit> Guided evolutionary strategies Schematic 0.25 Sample perturbations ✏ ∼ N (0 , Σ ) Guiding distribution Loss 0

  15. <latexit sha1_base64="xgziz2qIJBne+iFHmu40YhOTK+Q=">ACNnicdVBNT1NBFJ2H0BRKbpkM1pN0JjmvVIRd0RZsFExWiDpNM1909vHhPl4mZlH0rz0D/BrWIL/xA07wpY9G+a1JVGjJ5nk5Jx79x70lwK5+P4VzR37/6Dh/MLi7WlR4+fLNdXnu45U1iOHW6ksQcpOJRCY8cL/EgtwgqlbifHn2q/P1jtE4Y/cOPcuwpyLQYCg4+SP36S4a5E9JoypxQlCnwhxk+W8Fr+l7LvIFLzu1xtx80M7SdrvaEU2WlOyEa8nMU2a8QNMsNuv37DBoYXCrXnEpzrJnHueyVYL7jEcY0VDnPgR5BhN1ANCl2vnFwzpq+CMqBDY8PTnk7U3ztKUM6NVBoq23d314l/svrFn642SuFzguPmk8/GhaSekOraOhAWORejgIBbkXYlfJDsMB9CLDGtjHcYvFzmPs1Rwve2DclA5spocfhtow9ZxUNad1FQv9P9lrNZL3Z+tZubH2c5bZAVskLskYS8p5skR2ySzqEkxNySs7Jz+gsuoguo6tp6Vw063lG/kB0fQsJGqxU</latexit> <latexit sha1_base64="tNPwrnTJqaT6qo9hEVG3YAf6sQ=">ACu3icdVHLbhMxFPUMj5bwaIAlG0OEVB6NZqatkiwClWDBhEk0laK08j3JmYeh6yPYjI8gfxF/wG38IGz0wiUVSOZN3je67t63PjUnClg+CX59+4ev2zu6dzt179x/sdR8+OlVFJRlMWSEKeR5TBYLnMNVcCzgvJdAsFnAWX76r9bNvIBUv8i96XcI8o2nOE86odqlF94chzSUzmcZzE/aDBq+D/shcLQho9CmY5JIygyJQVNrIkwUTzN6YSKLJ9btqmxh+Di0F6beQqm4cA8YbomARO8nbfj+6okebrSLw624sE14ibYRbfXNDMcHO2q+FhS4bRCG876ENJovub7IsWJVBrpmgSs3CoNRzQ6XmTIDtkEpBSdklTWHmaE4zUHPTuGHxc5dZ4qSQbuUaN9m/TxiaKbXOYleZUb1S/2p18jptVulkODc8LysNOWsfSiqBdYHr+eAl8C0WDtCmeSuV8xW1Dmv3RQ75D24v0j46O79VIKkupAvDaEyzXhu3d9S8pTU1Lm1tQT/n5xG/fCwH30+6p3In61vu+gJeob2UYgG6AR9QBM0Rczb8469N95bf+wz/6sv2lLf23j9GF2BX/0BEFHZeA=</latexit> Guided evolutionary strategies Schematic 0.25 Sample perturbations ✏ ∼ N (0 , Σ ) Guiding distribution Loss Gradient estimate P � X g = ✏ i ( f ( x + ✏ i ) − f ( x − ✏ i )) 2 � 2 P i =1 0

  16. <latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit> Guided evolutionary strategies Schematic Choosing the guiding distribution 0.25 Standard (vanilla) ES Guiding distribution Identity covariance Loss nI + (1 − α ) Σ = α UU T k 0 𝛽 : hyperparameter n: parameter dimension

  17. <latexit sha1_base64="OfV4IOQRAbPKw5Qc3Ug12K2VJ3E=">ACL3icbVBNaxNBGJ6tsZo26hHEUaDUHoIuzHq5hbaHnopRjEfkE3D7ORNMmR2dpl5VwjLnvw1Htv+mNJL8drf4MXZJIg1PjDwzPN+P2EihUHXvXG2Hjzc3nlUelx+8nR3b7/y7HnXxKnm0OGxjHU/ZAakUNBgRL6iQYWhRJ64fy4iPe+gTYiVl9xkcAwYlMlJoIztNKo8qpDA6FoEDGchWH2JT/P7A9FBIbO81Gl6tazYb/3qeW+B98r069mrvEH1Ila7RHlV/BOZpBAq5ZMYMPDfBYcY0Ci4hLwepgYTxOZvCwFLF7Jxhtjwjp2+tMqaTWNunkC7VvysyFhmziEKbWaxr/o0V4v9igxQn/jATKkRF8NmqSYkwLT+hYaOAoF5YwroXdlfIZ04yjda4cnIC9RcOZ7fspAc0w1odZwPQ0Eiq3t02D10FBrVsb3mySbr3mvau5nxvV1tHatxJ5Sd6QA+KRj6RFTkmbdAgn38kPckmunAvn2rl1fq5St5x1zQtyD87db0ogqgk=</latexit> <latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit> Guided evolutionary strategies Schematic Choosing the guiding distribution 0.25 Guided ES Guiding distribution Identity + low rank covariance Loss nI + (1 − α ) Σ = α UU T k 0 Guiding subspace U ∈ R n × k columns are surrogate gradients 𝛽 : hyperparameter n: parameter dimension k: subspace dimension

  18. Demo Perturbed quadratic Quadratic function with a bias added to the gradient

  19. Demo Perturbed quadratic Quadratic function with a bias added to the gradient

  20. Demo Perturbed quadratic Quadratic function with a bias added to the gradient

  21. Example applications Unrolled optimization Surrogate gradient from one step of BPTT

  22. Example applications Unrolled optimization Synthetic gradients Surrogate gradient from one step of BPTT Surrogate gradient is from a synthetic model

  23. Summary Guided Evolutionary Strategies Optimization algorithm when you only have access to surrogate gradients Pacific Ballroom #146 Learn more at our poster brain-research/guided-evolutionary-strategies @niru_m

  24. <latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit> Choosing optimal hyperparameters Optimal hyperparameter ( ɑ ) Guided ES Identity + low rank covariance nI + (1 − α ) Σ = α UU T k

Recommend


More recommend