7/11/19 Distributed Training on HPC Presented By: Aaron D. Saxton, PhD
Statistics Review Simple y = π $ π¦ + π regression β’ β’ Least Squares to find m,b With data set { π¦ ) , π§ ) } )-.,..,0 β’ β’ Very special, often hard to measure π§ ) β’ Let the error be 0 [(π§ ) β π $ π¦ ) + π ] 7 β’ π = β )-. Minimize π with respect to π and π . β’ β’ Simultaneously Solve β’ π 8 π, π = 0 β’ π : (π, π) = 0 β’ Linear System We will consider more general π§ = π(π¦) β’ π 8 π, π = 0 and π : π, π = 0 may not be linear β’ 2
Statistics Review β’ Regressions with parameterized sets of functions. e.g. β’ π§ = ππ¦ 7 + ππ¦ + π (quadratic) β’ π§ = β π ) π¦ ) (polynomial) β’ π§ = ππ AB (exponential) . β’ π§ = .CD E(FGHI) (logistic) 3
Statistics Review β’ Polynomial model of degree βnβ β’ βdegrees of freedomβ - models capacity Deep Learning, Goodfellow et. al., MIT Press, http://www.deeplearningbook.org, 2016 4
Gradient Decent β’ Searching for minimum πΌπ = π K L , π K M , β¦ , π K O β’ π β π RC. = π β π R + πΏπΌπ β’ πΏ: Learning Rate β’ β’ Recall, Loss depends on data Expand notation, β’ π β π R ; { π¦ ) , π§ ) } 0 Recall π and πΌπ is a sum over π β’ Intuitively, want π with β’ 0 [(π§ ) β π ALL DATA β¦.. ? ( π = β )-. K W (π¦ ) )] 7 ) 5
Gradient Decent 6
Stochastic Gradient Decent 0 [(π§ ) β π K W (π¦ ) )] 7 ) Recall π is a sum over π ( π = β )-. β’ Single training example, π¦ ) , π§ ) , Sum over only one training example β’ πΌπ B X ,Y X = π K L , π K M , β¦ , π K O β’ B X ,Y X β β π B X ,Y X π RC. = π B X ,Y X π R + πΏπΌπ B X ,Y X β’ πΏ: Learning Rate β’ Choose next π¦ )C. , π§ )C. , (Shuffled training set) β’ β’ SGD with mini batches Many training example, π¦ ) , π§ ) , Sum over many training example β’ β’ Batch Size or Mini Batch Size (This gets ambiguous with distributed training) β’ SGD often outperforms traditional GD, want small batches. β’ https://arxiv.org/abs/1609.04836, On Large-Batch Training β¦ Sharp Minima β’ https://arxiv.org/abs/1711.04325, Extremely Large ... in 15 Minutes 7
Neural Networks β’ Activation functions Logistic ReLU (Rectified Linear Unit) 1.5 1.5 1 π π¦ = π π¦ = 1 0.5 0.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Arctan 2.5 π π¦ = -7.5 -5 -2.5 0 2.5 5 7.5 10 -2.5 β’ Softmax D I] β’ π [ π¦ . , π¦ 7 , β¦ , π¦ \ = β D IX 8
Neural Networks β’ Parameterized function β’ π ` = π π½ b8 + π½ 8 π β’ π e = πΎ b[ + πΎ [ π β’ π e π = π [ π β’ Linear Transformations with pointwise evaluation of nonlinear function, π π β π π π β’ πΎ b) , πΎ ) , π½ b8 , π½ 8 β’ Weights to be optimized 9
Faux Model Example 10
Distributed Training, data distributed 11
Distributed Training, data distributed 12
Distributed Training, All Reduce Collective 13
Distributed TensorFlow: Parameter Sever/Worker Default, Bad Way on HPC ps:0 worker:0 worker:1 Aggregate Model worker:2 Update Model Parameters Loss (Cross Entropy) Model ps:1 Loss (Cross Entropy) Optimize (Gradient Decent) Aggregate Loss (Cross Entropy) Optimize (Gradient Decent) Update Parameters Optimize (Gradient Decent) 14
Other models: Sequence Modeling Autoregression β’ i π ) πΆ ) π R + π R π R = π + β )-. Back Shift Operatior : πΆ ) Autocorrelation β’ π {{ (π’ . , π’ 7 ) = πΉ[π R ~ π R M ] Other tasks β’ Semantic Labeling β’ [art.] [adj.] [adj.] [n.] [v.] [adverb] [art.] [adj.] [adj.] [d.o.] The quick red fox jumps over the lazy brown dog
Recurrent Neural Networks: Sequence Modeling β’ Few projects use pure RNNs, this example is only for pedagogy β’ RNN is a model that is as βdeepβ as the modeled sequence is long β’ LSTMβs, Gated recurrent unit, β’ No Model Parallel distributed training on the market (June 2019) 16
Recommend
More recommend