Some Success Stories in Bridging Theory and Practice Anima Anandkumar Bren Professor at Caltech Director of ML Research at NVIDIA
SIGNSGD: COMPRESSED OPTIMIZATION FOR NON-CONVEX PROBLEMS JEREMY BERNSTEIN, JIAWEI ZHAO, KAMYAR AZZIZADENESHELI, YU-XIANG WANG, ANIMA ANANDKUMAR
DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data
DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION Compress? Compress? Parameter server Compress? GPU 1 GPU 2 With 1/2 data With 1/2 data
DISTRIBUTED TRAINING BY MAJORITY VOTE sign(g) GPU 1 Parameter sign(g) GPU 2 server GPU 3 sign(g) sign [sum(sign(g))] GPU 1 Parameter GPU 2 server GPU 3
⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ LARGE-BATCH ANALYSIS SINGLE WORKER RESULTS SINGLE WORKER RESULTS Assumptions Assumptions Define Define f * f * ➤ Objective function lower bound ➤ Objective function lower bound ➤ Number of iterations ➤ Number of iterations K K ➤ Coordinate-wise variance bound ➤ Coordinate-wise variance bound ➤ Number of backpropagations ➤ Number of backpropagations N N σ σ ➤ Coordinate-wise gradient Lipschitz ➤ Coordinate-wise gradient Lipschitz L L 피 [ 피 [ 2 ] ≤ 2 ] ≤ N [ 2 ∥ N [ 2 ∥ 2 ] 2 ] K − 1 K − 1 1 1 1 1 ∑ ∑ SGD gets rate SGD gets rate ∥ g k ∥ 2 ∥ g k ∥ 2 σ ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ L ∥ ∞ ( f 0 − f * ) + ∥ K K k =0 k =0 피 [ 피 [ 2 2 2 2 L ∥ 1 ( f 0 − f * + 1 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] ∥ g k ∥ 1 ] N [ N [ σ ∥ 1 ] σ ∥ 1 ] K − 1 K − 1 2 ) + 2 ∥ 2 ) + 2 ∥ 1 1 1 1 signSGD gets rate signSGD gets rate ∑ ∑ ≤ ≤ ∥ ∥ d ∥ σ ∥ 2 d ∥ g k ∥ 2 d ∥ L ∥ ∞ K K k =0 k =0
VECTOR DENSITY & ITS RELEVANCE IN DEEP LEARNING A sparse vector A dense vector Natural measure of density =1 for fully dense v ≈ 0 for fully sparse v Fully dense vector……………….a sign vector 7
DISTRIBUTED SIGNSGD: MAJORITY VOTE THEORY If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate: Same variance reduction as SGD
MINI-BATCH ANALYSIS Under symmetric noise assumption:
CIFAR-10 SNR
SIGNSGD PROVIDES “FREE LUNCH" P3.2x machines on AWS, Resnet50 on imagenet Throughput gain with only tiny accuracy loss
SIGNSGD: TIME PER EPOCH
SIGNSGD ACROSS DOMAINS AND ARCHITECTURES Huge throughput gain!
BYZANTINE FAULT TOLERANCE Under symmetric noise assumption:
SIGNSGD IS ALSO BYZANTINE FAULT TOLERANT
TAKE-AWAYS FOR SIGN-SGD • Convergence even under biased gradients and noise. • Faster than SGD in theory and in practice. • For distributed training, similar variance reduction as SGD. • In practice, similar accuracy but with far less communication.
LEARNING FROM NOISY SINGLY-LABELED DATA ASHISH KHETAN, ZACHARY C. LIPTON, ANIMA ANANDKUMAR
CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers.
CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule • Simple and common. • Wasteful: ignores annotator quality of different workers. Annotator-quality models • Can improve accuracy. • Hard: needs to be estimated without ground-truth.
SOME INTUITIONS Majority rule to estimate annotator quality • Justification: Majority rule approaches ground-truth when enough workers. • Downside: Requires large number of annotations for each example for majority rule to be correct. Annotator quality model (Prob. of correctness)
PROPOSED CROWDSOURCING ALGORITHM Noisy crowdsourced annotations Repeat Posterior of ground-truth labels given annotator quality model Training with weighted loss. Use posterior as weights MLE : update Annotator quality using inferred Use trained model to infer labels from model ground-truth labels
LABELING ONCE IS OPTIMAL: THEORY Theorem: Under fixed budget, generalization error minimized with single annotation per sample. Assumptions: • Best predictor is accurate enough (under no label noise). • Simplified case: All workers have same quality. • Prob. of being correct > 83%
LABELING ONCE IS OPTIMAL: PRACTICE MS-COCO dataset. Imagenet dataset. Fixed budget: 35k annotations Simulated workers and fixed budget 5% wrt Majority rule No. of workers
NEURAL RENDERING MODEL (NRM): JOINT GENERATION AND PREDICTION FOR SEMI-SUPERVISED LEARNING Nhat Ho, Tan Nguyen, Ankit Patel, A. , Michael Jordan, Richard Baraniuk
SEMI-SUPERVISED LEARNING WITH GENERATIVE MODELS? GAN Merits Peril • Captures statistics of • Feedback is real vs. fake: different from prediction. natural images • Introduces artifacts • Learnable
PREDICTIVE VS GENERATIVE MODELS P(y | x) P(x | y) y y One model to do both? x x • SOTA prediction from CNN models. • What class of p(x|y) yield CNN models for p(y|x)?
<latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> <latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit> NEURAL DEEP RENDERING MODEL (NRM) object y category Design joint priors for latent latent variables based on . intermediate . variables . reverse-engineering CNN rendering predictive architectures . . . image x
Recommend
More recommend