distributed training on hpc
play

Distributed Training on HPC Presented By: Aaron D. Saxton, PhD - PowerPoint PPT Presentation

7/11/19 Distributed Training on HPC Presented By: Aaron D. Saxton, PhD Statistics Review Simple y = $ + regression Least Squares to find m,b With data set { ) , ) } )-.,..,0 Very special, often hard to


  1. 7/11/19 Distributed Training on HPC Presented By: Aaron D. Saxton, PhD

  2. Statistics Review Simple y = 𝑛 $ 𝑦 + 𝑐 regression β€’ β€’ Least Squares to find m,b With data set { 𝑦 ) , 𝑧 ) } )-.,..,0 β€’ β€’ Very special, often hard to measure 𝑧 ) β€’ Let the error be 0 [(𝑧 ) βˆ’ 𝑛 $ 𝑦 ) + 𝑐 ] 7 β€’ 𝑆 = βˆ‘ )-. Minimize 𝑆 with respect to 𝑛 and 𝑐 . β€’ β€’ Simultaneously Solve β€’ 𝑆 8 𝑛, 𝑐 = 0 β€’ 𝑆 : (𝑛, 𝑐) = 0 β€’ Linear System We will consider more general 𝑧 = 𝑔(𝑦) β€’ 𝑆 8 𝑛, 𝑐 = 0 and 𝑆 : 𝑛, 𝑐 = 0 may not be linear β€’ 2

  3. Statistics Review β€’ Regressions with parameterized sets of functions. e.g. β€’ 𝑧 = 𝑏𝑦 7 + 𝑐𝑦 + 𝑑 (quadratic) β€’ 𝑧 = βˆ‘ 𝑏 ) 𝑦 ) (polynomial) β€’ 𝑧 = 𝑂𝑓 AB (exponential) . β€’ 𝑧 = .CD E(FGHI) (logistic) 3

  4. Statistics Review β€’ Polynomial model of degree β€˜n’ β€’ β€œdegrees of freedom” - models capacity Deep Learning, Goodfellow et. al., MIT Press, http://www.deeplearningbook.org, 2016 4

  5. Gradient Decent β€’ Searching for minimum 𝛼𝑆 = 𝑆 K L , 𝑆 K M , … , 𝑆 K O β€’ 𝑆 βƒ— πœ„ RC. = 𝑆 βƒ— πœ„ R + 𝛿𝛼𝑆 β€’ 𝛿: Learning Rate β€’ β€’ Recall, Loss depends on data Expand notation, β€’ 𝑆 βƒ— πœ„ R ; { 𝑦 ) , 𝑧 ) } 0 Recall 𝑆 and 𝛼𝑆 is a sum over 𝑗 β€’ Intuitively, want 𝑆 with β€’ 0 [(𝑧 ) βˆ’ 𝑔 ALL DATA ….. ? ( 𝑆 = βˆ‘ )-. K W (𝑦 ) )] 7 ) 5

  6. Gradient Decent 6

  7. Stochastic Gradient Decent 0 [(𝑧 ) βˆ’ 𝑔 K W (𝑦 ) )] 7 ) Recall 𝑆 is a sum over 𝑗 ( 𝑆 = βˆ‘ )-. β€’ Single training example, 𝑦 ) , 𝑧 ) , Sum over only one training example β€’ 𝛼𝑆 B X ,Y X = 𝑆 K L , 𝑆 K M , … , 𝑆 K O β€’ B X ,Y X βƒ— βƒ— 𝑆 B X ,Y X πœ„ RC. = 𝑆 B X ,Y X πœ„ R + 𝛿𝛼𝑆 B X ,Y X β€’ 𝛿: Learning Rate β€’ Choose next 𝑦 )C. , 𝑧 )C. , (Shuffled training set) β€’ β€’ SGD with mini batches Many training example, 𝑦 ) , 𝑧 ) , Sum over many training example β€’ β€’ Batch Size or Mini Batch Size (This gets ambiguous with distributed training) β€’ SGD often outperforms traditional GD, want small batches. β€’ https://arxiv.org/abs/1609.04836, On Large-Batch Training … Sharp Minima β€’ https://arxiv.org/abs/1711.04325, Extremely Large ... in 15 Minutes 7

  8. Neural Networks β€’ Activation functions Logistic ReLU (Rectified Linear Unit) 1.5 1.5 1 𝜏 𝑦 = 𝜏 𝑦 = 1 0.5 0.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Arctan 2.5 𝜏 𝑦 = -7.5 -5 -2.5 0 2.5 5 7.5 10 -2.5 β€’ Softmax D I] β€’ 𝑕 [ 𝑦 . , 𝑦 7 , … , 𝑦 \ = βˆ‘ D IX 8

  9. Neural Networks β€’ Parameterized function β€’ π‘Ž ` = 𝜏 𝛽 b8 + 𝛽 8 π‘Œ β€’ π‘ˆ e = 𝛾 b[ + 𝛾 [ π‘Ž β€’ 𝑔 e π‘Œ = 𝑕 [ π‘ˆ β€’ Linear Transformations with pointwise evaluation of nonlinear function, 𝜏 π‘ˆ β†’ 𝑍 π‘Œ π‘Ž β€’ 𝛾 b) , 𝛾 ) , 𝛽 b8 , 𝛽 8 β€’ Weights to be optimized 9

  10. Faux Model Example 10

  11. Distributed Training, data distributed 11

  12. Distributed Training, data distributed 12

  13. Distributed Training, All Reduce Collective 13

  14. Distributed TensorFlow: Parameter Sever/Worker Default, Bad Way on HPC ps:0 worker:0 worker:1 Aggregate Model worker:2 Update Model Parameters Loss (Cross Entropy) Model ps:1 Loss (Cross Entropy) Optimize (Gradient Decent) Aggregate Loss (Cross Entropy) Optimize (Gradient Decent) Update Parameters Optimize (Gradient Decent) 14

  15. Other models: Sequence Modeling Autoregression β€’ i 𝜚 ) 𝐢 ) π‘Œ R + πœ— R π‘Œ R = 𝑑 + βˆ‘ )-. Back Shift Operatior : 𝐢 ) Autocorrelation β€’ 𝑆 {{ (𝑒 . , 𝑒 7 ) = 𝐹[π‘Œ R ~ π‘Œ R M ] Other tasks β€’ Semantic Labeling β€’ [art.] [adj.] [adj.] [n.] [v.] [adverb] [art.] [adj.] [adj.] [d.o.] The quick red fox jumps over the lazy brown dog

  16. Recurrent Neural Networks: Sequence Modeling β€’ Few projects use pure RNNs, this example is only for pedagogy β€’ RNN is a model that is as β€œdeep” as the modeled sequence is long β€’ LSTM’s, Gated recurrent unit, β€’ No Model Parallel distributed training on the market (June 2019) 16

Recommend


More recommend