Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso - PowerPoint PPT Presentation

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

HopsFS: Next generation HDFS 37x Number of fles 16x Throughput *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi **https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf S c a l e C h a l l e n g e Wi n n e r ( 2 0 1 7 ) 2

Hops platform Projects, Datasets, Users Jobs, Grafana, ELK REST Jupyter, Zeppelin API Spark, T ensorfow, Hive, Kafka, Flink HopsFS, HopsYARN, MySQL NDB Cluster Version 0.3.0 just released! 3

Python frst C o n d a R e p o h c r E n v i r o n m e n t u s a b l e b y a e S S p a r k / T e n s o r fl o w I n s t a l l / R e m o v e P r o j e c t C o n d a e n v P y t h o n - 3 . 6 , p a n d a s - 1 . 4 , N u m p y - 0 . 9 Hops python library: Make development easy ● Hyperparameter searching ● Manage T ensorboard lifecycle 4

Find big datasets - Dela* ● Discover, Share and experiment with interesting datasets ● p2p network of Hops Cluster ● ImageNet, YouT ube8M, Reddit comments... ● Exploits unused bandwidth * h t t p : / / i e e e x p l o r e . i e e e . o r g / d o c u m e n t / 7 9 8 0 2 2 5 / ( I C D C S 2 0 1 7 ) 5

Scale out level: 1 Parallel Hyper parameter searching

Parallel Hyperparameter searching def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch ( spark , model, args_dict_grid) S t a r t s 6 p a r a l l e l e x p e r i m e n t s 7

Scale out Level: 2 Distributed Training

T ensorFlowOnSpark (TFoS) by Yahoo! ● Distributed T ensorFlow over Spark ● Runs on top of a Hadoop cluster ● PS/Workers executed inside Spark executors ● Uses Spark for resource allocations – Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s) ● Manages T ensorboard 9

Run TFoS def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide 10

Scale out level: Master of the dark arts Horovod

PS server architecture doesn’t scale F r o m : h t t p s : / / g i t h u b . c o m / u b e r / h o r o v o d 12

Horovod by Uber ● Based on previous work done by Baidu ● Organize workers in a ring ● Gradients updates distributed using All- Reduce ● Synchronous protocol 13

All-Reduce a 0 b 0 c 0 G P U 1 G P U 2 a 1 b 1 c 1 G P U 3 a 2 b 2 c 2 14

All-Reduce a 0 b 0 c 0 + c 2 G P U 1 G P U 2 a 0 + a 1 b 1 c 1 G P U 3 a 2 b 1 + b 2 c 2 15

All-Reduce a 0 b 0 + b 1 + b 2 c 0 + c 2 G P U 1 G P U 2 a 0 + a 1 b 1 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 2 16

All-Reduce G P U 1 a 0 b 0 + b 1 + b 2 c 0 + c 2 G P U 2 a 0 + a 1 b 1 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 2 17

All-Reduce G P U 1 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 2 G P U 2 a 0 + a 1 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 0 + c 1 + c 2 18

All-Reduce a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 1 G P U 2 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 19

Hops AllReduce import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch (spark, 'hdfs:///Projects/ …/all_reduce.ipynb') 20

Demo time!

Play with it → hops.io/?q=content/hopsworks-vagrant Doc → hops.io Star us! → github.com/hopshadoop Follow us! → @hopshadoop

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso - PowerPoint PPT Presentation

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB HopsFS: Next generation HDFS 37x Number of fles 16x Throughput

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Project Courses Prasun Dewan (dewan@cs.unc.edu) Department of Computer Science University of

Powerful tools for learning: Powerful tools for learning: Kernels and Similarity Kernels and

Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 The Course Web Page

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon

Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization

by Learning Entity-Level Distributed Representations K. Clark and C. Manning, ACL 2016

Distributed Learning for Cooperative Inference C esar A. Uribe . Collaboration with: Alex

TYPES OF SITUATIONS CLEAR SITUATIONS UNCLEAR SITUATIONS Level of difficulty: Level of

Sambuz

Useful Links

Newsletter

Mail Us