Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru Maheswaranathan // Google Research, Brain Team Joint work with: Luke Metz, George Tucker, Dami Choi, Jascha Sohl-dickstein
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased)
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications • Neural networks with non-differentiable layers
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications • Neural networks with non-differentiable layers • Meta-learning (where computing an exact meta-gradient is costly)
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications • Neural networks with non-differentiable layers • Meta-learning (where computing an exact meta-gradient is costly) • Gradients from surrogate models (synthetic gradients, black box attacks)
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased)
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) Zeroth-Order only function values, f(x)
Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) First-Order Zeroth-Order gradient information, 𝝰 f(x) only function values, f(x)
<latexit sha1_base64="QXzJ+63yzjnBhE/B24j/45Maj2U=">ACF3icbVDLSgNBEJz1bXxFPXoZDYJ6CLs+15uoBy+igjFCNsrspJMmZ1dZnrFsOQjPKof4028evRbvDiJQXwVDBRV3T3dFSZSGHTdN2dgcGh4ZHRsPDcxOTU9k5+duzBxqjmUeCxjfRkyA1IoKFACZeJBhaFEsph6Drl29AGxGrc2wnUI1YQ4m64AytVL69ylZwtXOdL7jF3d1Nf8unlvjbvrdOvaLbwxcpkD5Or/PvQS3maQKuWTGVDw3wWrGNAouoZMLUgMJ4y3WgIqlikVgqlv3Q5dtkqN1mNtn0LaU793ZCwyph2FtjJi2DS/va74n1dJse5XM6GSFEHxz4/qaQY0+7tCY0cJRtSxjXwu5KeZNpxtEmlAsOwd6i4djOPUlAM4z1WhYw3YiE6tjbGsFi0KU2rT/Z/CUX60Vvo+iebRb29vu5jZEFskRWiEd2yB45IqekRDhpkTvyQB6de+fJeXZePksHnH7PkB5/UDTIugxg=</latexit> Optimizing with surrogate gradients Surrogate gradient directions that are correlated with the true gradient (but may be biased) First-Order Guided ES Zeroth-Order gradient information, 𝝰 f(x) only function values, f(x) x ( t ) Surrogate gradient
Guided evolutionary strategies Schematic 0.25 Loss 0
Guided evolutionary strategies Schematic 0.25 Loss 0
Guided evolutionary strategies Schematic 0.25 Guiding distribution Loss 0
<latexit sha1_base64="xgziz2qIJBne+iFHmu40YhOTK+Q=">ACNnicdVBNT1NBFJ2H0BRKbpkM1pN0JjmvVIRd0RZsFExWiDpNM1909vHhPl4mZlH0rz0D/BrWIL/xA07wpY9G+a1JVGjJ5nk5Jx79x70lwK5+P4VzR37/6Dh/MLi7WlR4+fLNdXnu45U1iOHW6ksQcpOJRCY8cL/EgtwgqlbifHn2q/P1jtE4Y/cOPcuwpyLQYCg4+SP36S4a5E9JoypxQlCnwhxk+W8Fr+l7LvIFLzu1xtx80M7SdrvaEU2WlOyEa8nMU2a8QNMsNuv37DBoYXCrXnEpzrJnHueyVYL7jEcY0VDnPgR5BhN1ANCl2vnFwzpq+CMqBDY8PTnk7U3ztKUM6NVBoq23d314l/svrFn642SuFzguPmk8/GhaSekOraOhAWORejgIBbkXYlfJDsMB9CLDGtjHcYvFzmPs1Rwve2DclA5spocfhtow9ZxUNad1FQv9P9lrNZL3Z+tZubH2c5bZAVskLskYS8p5skR2ySzqEkxNySs7Jz+gsuoguo6tp6Vw063lG/kB0fQsJGqxU</latexit> Guided evolutionary strategies Schematic 0.25 Sample perturbations ✏ ∼ N (0 , Σ ) Guiding distribution Loss 0
<latexit sha1_base64="xgziz2qIJBne+iFHmu40YhOTK+Q=">ACNnicdVBNT1NBFJ2H0BRKbpkM1pN0JjmvVIRd0RZsFExWiDpNM1909vHhPl4mZlH0rz0D/BrWIL/xA07wpY9G+a1JVGjJ5nk5Jx79x70lwK5+P4VzR37/6Dh/MLi7WlR4+fLNdXnu45U1iOHW6ksQcpOJRCY8cL/EgtwgqlbifHn2q/P1jtE4Y/cOPcuwpyLQYCg4+SP36S4a5E9JoypxQlCnwhxk+W8Fr+l7LvIFLzu1xtx80M7SdrvaEU2WlOyEa8nMU2a8QNMsNuv37DBoYXCrXnEpzrJnHueyVYL7jEcY0VDnPgR5BhN1ANCl2vnFwzpq+CMqBDY8PTnk7U3ztKUM6NVBoq23d314l/svrFn642SuFzguPmk8/GhaSekOraOhAWORejgIBbkXYlfJDsMB9CLDGtjHcYvFzmPs1Rwve2DclA5spocfhtow9ZxUNad1FQv9P9lrNZL3Z+tZubH2c5bZAVskLskYS8p5skR2ySzqEkxNySs7Jz+gsuoguo6tp6Vw063lG/kB0fQsJGqxU</latexit> <latexit sha1_base64="tNPwrnTJqaT6qo9hEVG3YAf6sQ=">ACu3icdVHLbhMxFPUMj5bwaIAlG0OEVB6NZqatkiwClWDBhEk0laK08j3JmYeh6yPYjI8gfxF/wG38IGz0wiUVSOZN3je67t63PjUnClg+CX59+4ev2zu6dzt179x/sdR8+OlVFJRlMWSEKeR5TBYLnMNVcCzgvJdAsFnAWX76r9bNvIBUv8i96XcI8o2nOE86odqlF94chzSUzmcZzE/aDBq+D/shcLQho9CmY5JIygyJQVNrIkwUTzN6YSKLJ9btqmxh+Di0F6beQqm4cA8YbomARO8nbfj+6okebrSLw624sE14ibYRbfXNDMcHO2q+FhS4bRCG876ENJovub7IsWJVBrpmgSs3CoNRzQ6XmTIDtkEpBSdklTWHmaE4zUHPTuGHxc5dZ4qSQbuUaN9m/TxiaKbXOYleZUb1S/2p18jptVulkODc8LysNOWsfSiqBdYHr+eAl8C0WDtCmeSuV8xW1Dmv3RQ75D24v0j46O79VIKkupAvDaEyzXhu3d9S8pTU1Lm1tQT/n5xG/fCwH30+6p3In61vu+gJeob2UYgG6AR9QBM0Rczb8469N95bf+wz/6sv2lLf23j9GF2BX/0BEFHZeA=</latexit> Guided evolutionary strategies Schematic 0.25 Sample perturbations ✏ ∼ N (0 , Σ ) Guiding distribution Loss Gradient estimate P � X g = ✏ i ( f ( x + ✏ i ) − f ( x − ✏ i )) 2 � 2 P i =1 0
<latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit> Guided evolutionary strategies Schematic Choosing the guiding distribution 0.25 Standard (vanilla) ES Guiding distribution Identity covariance Loss nI + (1 − α ) Σ = α UU T k 0 𝛽 : hyperparameter n: parameter dimension
<latexit sha1_base64="OfV4IOQRAbPKw5Qc3Ug12K2VJ3E=">ACL3icbVBNaxNBGJ6tsZo26hHEUaDUHoIuzHq5hbaHnopRjEfkE3D7ORNMmR2dpl5VwjLnvw1Htv+mNJL8drf4MXZJIg1PjDwzPN+P2EihUHXvXG2Hjzc3nlUelx+8nR3b7/y7HnXxKnm0OGxjHU/ZAakUNBgRL6iQYWhRJ64fy4iPe+gTYiVl9xkcAwYlMlJoIztNKo8qpDA6FoEDGchWH2JT/P7A9FBIbO81Gl6tazYb/3qeW+B98r069mrvEH1Ila7RHlV/BOZpBAq5ZMYMPDfBYcY0Ci4hLwepgYTxOZvCwFLF7Jxhtjwjp2+tMqaTWNunkC7VvysyFhmziEKbWaxr/o0V4v9igxQn/jATKkRF8NmqSYkwLT+hYaOAoF5YwroXdlfIZ04yjda4cnIC9RcOZ7fspAc0w1odZwPQ0Eiq3t02D10FBrVsb3mySbr3mvau5nxvV1tHatxJ5Sd6QA+KRj6RFTkmbdAgn38kPckmunAvn2rl1fq5St5x1zQtyD87db0ogqgk=</latexit> <latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit> Guided evolutionary strategies Schematic Choosing the guiding distribution 0.25 Guided ES Guiding distribution Identity + low rank covariance Loss nI + (1 − α ) Σ = α UU T k 0 Guiding subspace U ∈ R n × k columns are surrogate gradients 𝛽 : hyperparameter n: parameter dimension k: subspace dimension
Demo Perturbed quadratic Quadratic function with a bias added to the gradient
Demo Perturbed quadratic Quadratic function with a bias added to the gradient
Demo Perturbed quadratic Quadratic function with a bias added to the gradient
Example applications Unrolled optimization Surrogate gradient from one step of BPTT
Example applications Unrolled optimization Synthetic gradients Surrogate gradient from one step of BPTT Surrogate gradient is from a synthetic model
Summary Guided Evolutionary Strategies Optimization algorithm when you only have access to surrogate gradients Pacific Ballroom #146 Learn more at our poster brain-research/guided-evolutionary-strategies @niru_m
<latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit> Choosing optimal hyperparameters Optimal hyperparameter ( ɑ ) Guided ES Identity + low rank covariance nI + (1 − α ) Σ = α UU T k
Recommend
More recommend