some success stories in bridging theory and practice
play

Some Success Stories in Bridging Theory and Practice Anima - PowerPoint PPT Presentation

Some Success Stories in Bridging Theory and Practice Anima Anandkumar Bren Professor at Caltech Director of ML Research at NVIDIA SIGNSGD: COMPRESSED OPTIMIZATION FOR NON-CONVEX PROBLEMS JEREMY BERNSTEIN, JIAWEI ZHAO, KAMYAR AZZIZADENESHELI,


  1. Some Success Stories in Bridging Theory and Practice Anima Anandkumar Bren Professor at Caltech Director of ML Research at NVIDIA

  2. SIGNSGD: COMPRESSED OPTIMIZATION FOR NON-CONVEX PROBLEMS JEREMY BERNSTEIN, JIAWEI ZHAO, KAMYAR AZZIZADENESHELI, YU-XIANG WANG, ANIMA ANANDKUMAR

  3. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data

  4. DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Compress? Compress? Parameter server Compress? GPU 1 GPU 2 With 1/2 data With 1/2 data

  5. DISTRIBUTED TRAINING BY MAJORITY VOTE sign(g) GPU 1 Parameter sign(g) GPU 2 server GPU 3 sign(g) sign [sum(sign(g))] GPU 1 Parameter GPU 2 server GPU 3

  6. ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ LARGE-BATCH ANALYSIS SINGLE WORKER RESULTS SINGLE WORKER RESULTS Assumptions Assumptions Define Define f * f * ➤ Objective function lower bound ➤ Objective function lower bound ➤ Number of iterations ➤ Number of iterations K K ➤ Coordinate-wise variance bound ➤ Coordinate-wise variance bound ➤ Number of backpropagations ➤ Number of backpropagations N N σ σ ➤ Coordinate-wise gradient Lipschitz ➤ Coordinate-wise gradient Lipschitz L L 피 [ 피 [ 2 ] ≤ 2 ] ≤ N [ 2 ∥ N [ 2 ∥ 2 ] 2 ] K − 1 K − 1 1 1 1 1 ∑ ∑ SGD gets rate SGD gets rate ∥ g k ∥ 2 ∥ g k ∥ 2 σ ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ L ∥ ∞ ( f 0 − f * ) + ∥ K K k =0 k =0 피 [ 피 [ 2 2 2 2 L ∥ 1 ( f 0 − f * + 1 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] ∥ g k ∥ 1 ] N [ N [ σ ∥ 1 ] σ ∥ 1 ] K − 1 K − 1 2 ) + 2 ∥ 2 ) + 2 ∥ 1 1 1 1 signSGD gets rate signSGD gets rate ∑ ∑ ≤ ≤ ∥ ∥ d ∥ σ ∥ 2 d ∥ g k ∥ 2 d ∥ L ∥ ∞ K K k =0 k =0

  7. VECTOR DENSITY & ITS RELEVANCE IN DEEP LEARNING A sparse vector A dense vector Natural measure of density =1 for fully dense v ≈ 0 for fully sparse v Fully dense vector……………….a sign vector 7

  8. DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate: Same variance reduction as SGD

  9. MINI-BATCH ANALYSIS Under symmetric noise assumption:

  10. CIFAR-10 SNR

  11. SIGNSGD PROVIDES “FREE LUNCH" P3.2x machines on AWS, Resnet50 on imagenet Throughput gain with only tiny accuracy loss

  12. SIGNSGD: TIME PER EPOCH

  13. SIGNSGD ACROSS DOMAINS AND ARCHITECTURES Huge throughput gain!

  14. BYZANTINE FAULT TOLERANCE Under symmetric noise assumption:

  15. SIGNSGD IS ALSO BYZANTINE FAULT TOLERANT

  16. TAKE-AWAYS FOR SIGN-SGD • Convergence even under biased gradients and noise. • Faster than SGD in theory and in practice. • For distributed training, similar variance reduction as SGD. • In practice, similar accuracy but with far less communication.

  17. LEARNING FROM NOISY SINGLY-LABELED DATA ASHISH KHETAN, ZACHARY C. LIPTON, ANIMA ANANDKUMAR

  18. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers.

  19. CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers. Annotator-quality models • Can improve accuracy. • Hard: needs to be estimated without ground-truth.

  20. SOME INTUITIONS Majority rule to estimate annotator quality • Justification: Majority rule approaches ground-truth when enough workers. • Downside: Requires large number of annotations for each example for majority rule to be correct. Annotator quality model (Prob. of correctness)

  21. PROPOSED CROWDSOURCING ALGORITHM Noisy crowdsourced annotations Repeat Posterior of ground-truth labels given annotator quality model Training with weighted loss. Use posterior as weights MLE : update Annotator quality using inferred Use trained model to infer labels from model ground-truth labels

  22. LABELING ONCE IS OPTIMAL: THEORY Theorem: Under fixed budget, generalization error minimized with single annotation per sample. Assumptions: • Best predictor is accurate enough (under no label noise). • Simplified case: All workers have same quality. • Prob. of being correct > 83%

  23. LABELING ONCE IS OPTIMAL: PRACTICE MS-COCO dataset. Imagenet dataset. Fixed budget: 35k annotations Simulated workers and fixed budget 5% wrt Majority rule No. of workers

  24. NEURAL RENDERING MODEL (NRM): JOINT GENERATION AND PREDICTION FOR SEMI-SUPERVISED LEARNING Nhat Ho, Tan Nguyen, Ankit Patel, A. , Michael Jordan, Richard Baraniuk

  25. SEMI-SUPERVISED LEARNING WITH GENERATIVE MODELS? GAN Merits Peril • Captures statistics of • Feedback is real vs. fake: different from prediction. natural images • Introduces artifacts • Learnable

  26. PREDICTIVE VS GENERATIVE MODELS P(y | x) P(x | y) y y One model to do both? x x • SOTA prediction from CNN models. • What class of p(x|y) yield CNN models for p(y|x)?

  27. <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> NEURAL DEEP RENDERING MODEL (NRM) object y category Design joint priors for latent latent variables based on . intermediate . variables . reverse-engineering CNN rendering predictive architectures . . . image x

Recommend


More recommend