signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - PowerPoint PPT Presentation

signSGD Jeremy Bernstein   Yu-Xiang Wang   Caltech UCSB/Amazon Compressed optimisation for   non-convex problems Kamyar Azizzadenesheli   Anima Anandkumar   UCI Caltech/Amazon

Snap gradient components to ± 1 Reduces signSGD communication time Compressed optimisation for   non-convex problems Realistic for deep learning

STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

GRADIENT COMPRESSION……WHY CARE? Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data

GRADIENT COMPRESSION……WHY CARE? COMPRESS? GPU 7 GPU 1 Parameter GPU 7 server GPU 2 GPU 5 GPU 3 GPU 4

DISTRIBUTED SGD Parameter server ∑ g g g GPU 1 GPU 2 With 1/2 data With 1/2 data

SIGN SGD WITH MAJORITY VOTE Parameter server sign [ ∑ sign( g ) ] sign(g) sign(g) GPU 1 GPU 2 With 1/2 data With 1/2 data

COMPRESSION SAVINGS OF MAJORITY VOTE 40 30 # bits   per component 20 per iteration 10 0 SGD Majority vote

SIGN SGD IS A SPECIAL CASE OF ADAM Adam signSGD Signum (Sign momentum)

ADAM……………………WHY CARE? 11000 Turing Adam test # of 8250 Google scholar 5500 SGD Robbins & Monro citations Kingma & Ba 2750 Turing 0

UNIFYING ADAPTIVE GRADIENT METHODS + COMPRESSION Sign descent Compressed descent weak theoretical foundation weak theoretical foundation incredibly popular (e.g. Adam) take pains to correct bias empirically successful Need to build a Sign-based gradient compression? theory

DOES SIGN SGD EVEN CONVERGE? What might we fear? ➤ Might not converge at all ➤ Might have horrible dimension dependence ➤ Majority vote may give no speedup by adding extra machines Compression can be a free lunch Our results ➤ It does converge ➤ We characterise functions where signSGD & majority vote are as nice as SGD ➤ Suggest these functions are typical in deep learning

⃗ ⃗ ⃗ ⃗ ⃗ ⃗ SINGLE WORKER RESULTS Assumptions Define f * ➤ Objective function lower bound ➤ Number of iterations K ➤ Coordinate-wise variance bound ➤ Number of backpropagations N σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ 2 ] K − 1 1 1 ∑ SGD gets rate ∥ g k ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ K k =0 피 [ 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ σ ∥ 1 ] K − 1 2 ) + 2 ∥ 1 1 signSGD gets rate ∑ ≤ ∥ K k =0

⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ SINGLE WORKER RESULTS Assumptions Define f * ➤ Objective function lower bound ➤ Number of iterations K ➤ Coordinate-wise variance bound ➤ Number of backpropagations N σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ 2 ] K − 1 1 1 ∑ SGD gets rate ∥ g k ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ K k =0 피 [ 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ σ ∥ 1 ] K − 1 2 ) + 2 ∥ 1 1 signSGD gets rate ∑ ≤ ∥ d ∥ σ ∥ 2 d ∥ g k ∥ 2 d ∥ L ∥ ∞ K k =0

⃗ ⃗ ⃗ ⃗ ⃗ ⃗ with M workers MULTI WORKER RESULTS Assumptions f * ➤ Objective function lower bound ➤ Coordinate-wise variance bound σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ ] 2 K − 1 L ∥ ∞ ( f 0 − f * ) + ∥ σ ∥ 2 1 1 ∑ SGD gets rate ∥ g k ∥ 2 K M k =0 피 [ if gradient noise is 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ ] K − 1 2 ) + 2 ∥ σ ∥ 1 1 1 ∑ unimodal symmetric ≤ ∥ K M majority vote gets k =0

CHARACTERISING THE DEEP LEARNING LANDSCAPE EMPIRICALLY ➤ signSGD cares about gradient density ➤ majority vote cares about noise symmetry Natural measure of density =1 for fully dense v ≈ 0 for fully sparse v For large enough mini-batch size, reasonable by Central Limit Theorem.

SIGNUM IS COMPETITIVE ON IMAGENET Performance very similar to Adam May want to switch to SGD towards end?

DOES MAJORITY VOTE WORK? Cifar-10, Resnet-18 Jiawei Zhao   NUAA

Poster tonight! 6.15—9 PM @ Hall B #72

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - PowerPoint PPT Presentation

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon Snap gradient components to 1 Reduces signSGD

Some Success Stories in Bridging Theory and Practice Anima Anandkumar Bren Professor at Caltech

Fast method for testing the smoothness of polynomials Jean-Franc ois Biasse Mike Jacobson

Outline Saltzer & Schroeders principles (contd) CSci 5271 More secure design

Matrix Methods for the Bernstein Form and Their Application in Global Optimization June, 10

Does open-source cryptographic software work correctly? Daniel J. Bernstein CVE-2018-0733, an

Curves and Splines Outline Hermite Splines Catmull-Rom Splines Bezier Curves

The libpqcrypto software library for post-quantum cryptography Daniel J. Bernstein and many

Post-quantum cryptography Daniel J. Bernstein 1 Tanja Lange 1 Peter Schwabe 2 Technische

40th Anniversary Midwest Representation Theory Conference University of Chicago September 5-7,

Cryptography: an Efficiency and Security Analysis http://eprint.iacr.org/2014/130.pdf Craig

New CP (and T) Tests in Low-Energy Hadronic Processes Susan Gardner Department of Physics and

On-Device Power Analysis Across Hardware Security Domains Colin OFlynn , Alex Dewar Dalhousie

Instructional Webinar: What, how, and where to enter the RAMP Competition William (Bill)

The image shows horseshoe crabs (HSCs) being bled in a biomedical laboratory. Notice that the

Software Engineering Writing Intensive Dr. Barry Wittman Not Dr. Barry Whitman

Defending your DNS in a post-Kaminsky world Paul Wouters <paul@xelerance.com> Vendor and

Random Data and Key Generation Evaluation of Some

Information Model for Impaired Optical Path Validation

Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. Rubinstein, Renata Borovica-Gajic

January 30, 2020 Mill 19, Hazelwood Green The development plans for Hazelwood Green to restore

Transaction Models and Concurrency Control 5DV120 Database System Principles Ume a

Security Evaluation of Physical RNGs Werner Schindler, Workshop on Randomness and Arithmetics for

Advanced Computer Graphics Advanced Computer Graphics CS 563: Curves and Curved Surfaces William

CAS CS 460/660 Introduction to Database Systems Transactions and Concurrency Control 1.1

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - PowerPoint PPT Presentation

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon Snap gradient components to 1 Reduces signSGD

Some Success Stories in Bridging Theory and Practice Anima Anandkumar Bren Professor at Caltech

Fast method for testing the smoothness of polynomials Jean-Franc ois Biasse Mike Jacobson

Outline Saltzer &amp; Schroeders principles (contd) CSci 5271 More secure design

Matrix Methods for the Bernstein Form and Their Application in Global Optimization June, 10

Does open-source cryptographic software work correctly? Daniel J. Bernstein CVE-2018-0733, an

Curves and Splines Outline Hermite Splines Catmull-Rom Splines Bezier Curves

The libpqcrypto software library for post-quantum cryptography Daniel J. Bernstein and many

Post-quantum cryptography Daniel J. Bernstein 1 Tanja Lange 1 Peter Schwabe 2 Technische

40th Anniversary Midwest Representation Theory Conference University of Chicago September 5-7,

Cryptography: an Efficiency and Security Analysis http://eprint.iacr.org/2014/130.pdf Craig

New CP (and T) Tests in Low-Energy Hadronic Processes Susan Gardner Department of Physics and

On-Device Power Analysis Across Hardware Security Domains Colin OFlynn , Alex Dewar Dalhousie

Instructional Webinar: What, how, and where to enter the RAMP Competition William (Bill)

The image shows horseshoe crabs (HSCs) being bled in a biomedical laboratory. Notice that the

Software Engineering Writing Intensive Dr. Barry Wittman Not Dr. Barry Whitman

Defending your DNS in a post-Kaminsky world Paul Wouters &lt;paul@xelerance.com&gt; Vendor and

Random Data and Key Generation Evaluation of Some

Information Model for Impaired Optical Path Validation

Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. Rubinstein, Renata Borovica-Gajic

January 30, 2020 Mill 19, Hazelwood Green The development plans for Hazelwood Green to restore

Transaction Models and Concurrency Control 5DV120 Database System Principles Ume a

Security Evaluation of Physical RNGs Werner Schindler, Workshop on Randomness and Arithmetics for

Advanced Computer Graphics Advanced Computer Graphics CS 563: Curves and Curved Surfaces William

CAS CS 460/660 Introduction to Database Systems Transactions and Concurrency Control 1.1

Outline Saltzer & Schroeders principles (contd) CSci 5271 More secure design

Defending your DNS in a post-Kaminsky world Paul Wouters <paul@xelerance.com> Vendor and