. . . . . . . . . . . . . . 5194.01: Introduction to High-Performance Deep Learning Mesh-TensorFlow & SparkNet Shen Wang The Ohio State University 10/21/2020 Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . 1 / 33
. . . . . . . . . . . . . . . . SparkNet & Mesh-TensorFlow SPARKNET: TRAINING DEEP NETWORKS IN SPARK Mesh-TensorFlow: Deep Learning for Supercomputers Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . 2 / 33
. . . . . . . . . . . . . . . . . Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 3 / 33
. Training DNN is time-consuming . . . . . . . . . . Background Using computational cluster to speed up training . Many attempts to speed up the training of deep networks rely on asynchronous, lock-free optimization. Batch-processing frameworks become popular. However, state-of-the-art deep learning systems rely on custom implementations to facilitate their asynchronous, communication-intensive workloads. SparkNet is designed to integrate distributed training algorithm with existing batch computational frameworks such as MapReduce and Spark. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 / 33
. . . . . . . . . . . . . . . Architecutre of SparkNet, Parameter server model Master node: keep the latest model parameters and serve them to worker nodes Worker nodes: Compute gradients with respect to the parameters and ship them back to master nodes Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 5 / 33
. . . . . . . . . . . . Advantages . It is convenient to integrate model training with the existing data-processing pipelines. Allows data to be kept in memory from start to fjnish, train and visualize within single framework *Hardware requirements are minimal Many distributed training approaches requires heavy communication. SparkNet dose not require optimization for communication in within cluster Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 / 33
. . . . . . . . . . . . . . Implementation: distributed training Master node broadcasts parameters to worker nodes Worker nodes train on batches individually for 50 iterations and ship back the parameters. Master node update parameters with the average of worker nodes and broadcast the new parameters. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 33
. . . . . . . . . . . . . . . Theroretical limitations for difgerent parallelization schemes No parallelization accuracy of a when train- ing with a batch size of b Each block corresponds to a single SGD update with batch size b Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 8 / 33 N a ( b ) : number of serial iterations of SGD required to obtain an C ( b ) : time for computing the gradient over a batch of size b Total time: T 0 = N a ( b ) × C ( b )
. . . . . . . . . . . . . . Theroretical limitations for difgerent parallelization schemes Naive parallelization Distribute the computation by dividing minibatch for K machines. Broadcasting parameters takes time S S (limitation) Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . 9 / 33 . . . . Time for single node in a single iteration becomes C ( b / K ) Total time: T 1 = N a ( b ) × ( C ( b / K ) + S ) ( C ( b / K )+ S ) ∗ N a ( b ) < C ( b ) C ( b ) N a ( b ) T 1 = Speed up rate T 0
. . . . . . . . . . . . . . . Theroretical limitations for difgerent parallelization schemes SparkNet parallelization Distribute the computation in rounds for K machines. Broadcasting parameters takes time S Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 10 / 33 In a round, each machine runs SGD for τ iterations with batch size b M a ( b , K , τ ) : number of rounds required to achieve an accuracy of a. Time for single node in a single iteration is still C ( b ) Total time: T 2 = M a ( b , K , τ ) × ( τ ∗ C ( b ) + S ) C ( b ) N a ( b ) T 2 = Speed up rate T 0 ( τ ∗ C ( b )+ S ) ∗ M a ( b , K ,τ )
. . . . . . . . . . . . . . . Theroretical limitations for SparkNet parallelization Disregard the overhead due to synchronization speed up rate run SparkNet using a modifjed version of AlexNet on a subset of ImageNet (fjrst 100 classes each with approximately 1000 data) Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 11 / 33 C ( b ) N a ( b ) T 2 = Speed up rate T 0 ( τ ∗ C ( b )+ S ) ∗ M a ( b , K ,τ ) N a ( b ) τ ∗ M a ( b , K ,τ )
. . . . . . . . . . . . . . . . . Speed up rate disregarding communication Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 12 / 33 K = 1 case: only one worker, τ will not have efgects. τ = 1 case: equivalent to running serial SGD with batch size of K ∗ b Same K : The speed up does not increase as τ decrease. (surprising)
. . . . . . . . . . . . . . . Speed up with consideration of communication Naive parallelization gives no speed up when communication overhead is large However, SparkNet can give relatively consistent speedup when communication overhead is quite large. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 13 / 33 τ = 1 , 2 , 5 , 10 , 25 , 100 , 500 , 1000 , 2500
. . . . . . . . . . . . . . . Training benchmarks Figure: Performance with AlexNet(left) and GoogLeNet(right) on ImageNet dataset Train the default Cafge model of AlexNet on ImageNet dataset. Train the default Cafge model of GoogLeNet on ImageNet dataset Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 14 / 33 Compare the wall-clock time required to obtain an accuracy of 45 % Compare the wall-clock time required to obtain an accuracy of 40 %
. . . . . . . . . . . . . . . . . Each experiment was run with K = 5 worker Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 15 / 33 Dependence of Parallelization scheme on τ
. . . . . . . . . . . . . . . Conclusion easy-to-use deep learning implementation for Spark that is based on Cafge and enables the easy parallelization of existing Cafge models with minimal modifjcation Their approach is efgective even in highly bandwidth-limited settings. Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . 16 / 33
. . . . . . . . . . . . . . . . . Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . 17 / 33
. distributed DNN . . . . . . . . . Background Batch-splitting (data parallelism) is dominant training strategy for Inability to train very large models (memory constraints) . High latency and ineffjciency at small batch size Model-parallelism can solved the problems of batch-splitting Complicated to specify the distribution strategies Diffjcult to compile and optimize Mesh-TensorFlow: a language for specifying a general class of distributed tensor computations User can specify any tensor dimensions to be split across any dimensions of a multi dimensional mesh of processors Shen Wang (The Ohio State University) 5194.01: Introduction to High-Performance Deep Learning 10/21/2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 / 33
Recommend
More recommend