Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB
HopsFS: Next generation HDFS 37x Number of fles 16x Throughput *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi **https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf S c a l e C h a l l e n g e Wi n n e r ( 2 0 1 7 ) 2
Hops platform Projects, Datasets, Users Jobs, Grafana, ELK REST Jupyter, Zeppelin API Spark, T ensorfow, Hive, Kafka, Flink HopsFS, HopsYARN, MySQL NDB Cluster Version 0.3.0 just released! 3
Python frst C o n d a R e p o h c r E n v i r o n m e n t u s a b l e b y a e S S p a r k / T e n s o r fl o w I n s t a l l / R e m o v e P r o j e c t C o n d a e n v P y t h o n - 3 . 6 , p a n d a s - 1 . 4 , N u m p y - 0 . 9 Hops python library: Make development easy ● Hyperparameter searching ● Manage T ensorboard lifecycle 4
Find big datasets - Dela* ● Discover, Share and experiment with interesting datasets ● p2p network of Hops Cluster ● ImageNet, YouT ube8M, Reddit comments... ● Exploits unused bandwidth * h t t p : / / i e e e x p l o r e . i e e e . o r g / d o c u m e n t / 7 9 8 0 2 2 5 / ( I C D C S 2 0 1 7 ) 5
Scale out level: 1 Parallel Hyper parameter searching
Parallel Hyperparameter searching def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch ( spark , model, args_dict_grid) S t a r t s 6 p a r a l l e l e x p e r i m e n t s 7
Scale out Level: 2 Distributed Training
T ensorFlowOnSpark (TFoS) by Yahoo! ● Distributed T ensorFlow over Spark ● Runs on top of a Hadoop cluster ● PS/Workers executed inside Spark executors ● Uses Spark for resource allocations – Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s) ● Manages T ensorboard 9
Run TFoS def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide 10
Scale out level: Master of the dark arts Horovod
PS server architecture doesn’t scale F r o m : h t t p s : / / g i t h u b . c o m / u b e r / h o r o v o d 12
Horovod by Uber ● Based on previous work done by Baidu ● Organize workers in a ring ● Gradients updates distributed using All- Reduce ● Synchronous protocol 13
All-Reduce a 0 b 0 c 0 G P U 1 G P U 2 a 1 b 1 c 1 G P U 3 a 2 b 2 c 2 14
All-Reduce a 0 b 0 c 0 + c 2 G P U 1 G P U 2 a 0 + a 1 b 1 c 1 G P U 3 a 2 b 1 + b 2 c 2 15
All-Reduce a 0 b 0 + b 1 + b 2 c 0 + c 2 G P U 1 G P U 2 a 0 + a 1 b 1 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 2 16
All-Reduce G P U 1 a 0 b 0 + b 1 + b 2 c 0 + c 2 G P U 2 a 0 + a 1 b 1 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 2 17
All-Reduce G P U 1 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 2 G P U 2 a 0 + a 1 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 0 + c 1 + c 2 18
All-Reduce a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 1 G P U 2 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 19
Hops AllReduce import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch (spark, 'hdfs:///Projects/ …/all_reduce.ipynb') 20
Demo time!
Play with it → hops.io/?q=content/hopsworks-vagrant Doc → hops.io Star us! → github.com/hopshadoop Follow us! → @hopshadoop
Recommend
More recommend