High Performance Machine Learning: Advances, Challenges and Opportunities Eduardo Rodrigues Lecture @ ERAD-RS - April 11th, 2019
IBM Research
❆❉❱❆◆❈❊❙
Artificial Intelligence Deep Blue (1997)
AI and Machine Learning AI ML
Jeopardy (2011)
Debater https://www.youtube.com/watch?v=UeF_N1r91RQ
Machine Learning is becoming central to ✘✘✘ many all industries ❳❳❳ ✘ ❳ ◮ Nine out of 10 executives from around the world describe AI as important to solving their organizations’ strategic challenges. ◮ Over the next decade, AI enterprise software revenue will grow from $644 million to nearly $39 billion ◮ Services-related revenue should reach almost $150 billion
AI identifies which primates could be carrying the Zika virus
Biophysics-Inspired AI Uses Photons to Help Surgeons Identify Cancer
IBM takes on Alzheimer’s disease with machine learning
Seismic Facies Segmentation Using Deep Learning
Crop detection
Automatic Citrus Tree Detection from UAV Images
Agropad https://www.youtube.com/watch?v=UYVc0TeuK-w
HPC and ML/AI ◮ As data abounds, deeper and more complex models are developed ◮ These models have many parameters and hyperparameters to tune ◮ A cycle of train, test and adjust is done many times before good results can be achieved ◮ Speedup exploratory cycle improves productivity ◮ Parallel execution is the solution
Basics: deep learning sequential execution Training basics ◮ loop over mini-batches and epochs ◮ forward propagation ◮ compute loss ◮ backward propagation (gradients) ◮ update parameters 1 ∂ L i � L = L i , N bs ∂ W n i
Parallel execution single node - multi-GPU system Many ways to divide the deep neural network The most common strategy is to divide mini-batches across GPUs ◮ The model is replicated across GPUs ◮ Data is divided among them ◮ Two possible approaches: ◮ non-overlapping division ◮ shuffled division ◮ Each GPU computes forward, cost and mini-batch gradients ◮ Gradients are then averaged and stored in a shared space (visible to all GPUs)
Parallelization strategies multi-node One can use a similar strategy with multi-node It requires communication across nodes Two strategies: ◮ Asynchronous ◮ Synchronous
Synchronous ◮ Can be implemented with high efficiency protocols ◮ No need to exchange variables ◮ Faster in terms of time to quality
DDL - Distributed Deep Learning ◮ We use a mesh-tori like reduction ◮ Earlier dimensions need more BW to transfer ◮ Later dimensions need less BW to transfer
Hierarchical communication (1)
Hierarchical communication (2) Reduce example This shows a single example of communication pattern that benefits from hierarchical communication More bandwith at the beginning
Hierarchical communication (2) Reduce example This shows a single example of communication pattern that benefits from hierarchical communication Progressivelly less bandwith is required
Hierarchical communication (2) Reduce example This shows a single example of communication pattern that benefits from hierarchical communication Progressivelly less bandwith is required
Seismic Segmentation Models based on DNNs A symbiotic partnership ◮ Deep Neural Networks have become the main tool for visual recognition ◮ They also have been used by seismologists to help interpret seismic data ◮ Relevant training examples may be sparse ◮ Training these models may take very long ◮ Parallel execution speed up training
Seismic Segmentation Models based on DNNs Challenges ◮ Current deep leaning models (Alexnet, VGG, Inception) do not fit well the task ◮ They are too big ◮ Little data (compared to traditional vision recognition tasks) ◮ Data pre-processing forces model’s input to be smaller ◮ Parallel execution strategies proposed in the literature are not appropriate
What is the recommendation:
Traditional technique
Traditional technique
Traditional technique pitfalls Key assumptions are: ◮ the full batch is very large ◮ the effective minibatch is still a small fraction of the full batch A hidden assumption is that small full batches don’t need to run in parallel
Not only Imagenet can benefit from parallel execution
weak scaling, strong scaling
weak scaling, strong scaling
our experiments (1) Time to run 200 epochs Strong 2500 Weak 2000 execution time (s) 1500 1000 500 0 2 4 8 # of GPUs
our experiments (1) Time to run 200 epochs Intersection over union 0.7 strong 2 GPUs Strong 2500 strong 4 GPUs Weak 0.6 strong 8 GPUs weak 2 GPUs 2000 0.5 weak 4 GPUs execution time (s) weak 8 GPUs 0.4 1500 IOU 0.3 1000 0.2 500 0.1 0.0 0 25 50 75 100 125 150 175 200 2 4 8 Epochs # of GPUs
our experiments (2) Time to reach 60% IOU Strong 17500 Weak 15000 execution time (s) 12500 10000 7500 5000 2500 0 2 4 8 # of GPUs
our experiments (2) Time to reach 60% IOU Intersection over union 0.7 Strong 17500 Weak 0.6 15000 0.5 execution time (s) 12500 0.4 10000 IOU 0.3 7500 strong 2 GPUs strong 4 GPUs 0.2 5000 strong 8 GPUs weak 2 GPUs 0.1 weak 4 GPUs 2500 weak 8 GPUs 0.0 0 2 4 8 0 500 1000 1500 2000 # of GPUs Epochs
HPC AI
HPC AI
Motivation ◮ End-users must specify several parameters in their job submissions to the queue system, e.g.: ◮ Number of processors ◮ Queue / Partition ◮ Memory requirements ◮ Other resource requirements ◮ Those parameters have direct impact in the job turnaround time and, more importantly, in the total system utilization ◮ Frequently, end-users are not aware of the implications of the parameters they use ◮ System log keeps valuable information that can be leveraged to improve parameter choice
Related work ◮ Karnak has been used in XSEDE to predict waiting a label time and runtime ◮ Useful for users to plan their e E ( � q ) Point in the knowledge base experiments b Query neighborhood d D ( � q,� x ) = D ( � x,� q ) = d � x ◮ The method may not apply c well for other job f a f b Query ( � q ) x � Neighbor ( � p ) parameters, for example memory requirements
Memory requirements ◮ System owner wants to maximize utilization ◮ Users may not specify memory precisely ◮ Log data can provide training examples for a machine learning approach for predicting memory requirements ◮ This can be seen as a supervised learning task ◮ We have a set of features (e.g. user id, cwd, command parameters, submission time, etc) ◮ We want to predict memory requirements (label)
The Wisdom of Crowds There are many learning algorithms available, e.g. Classification trees, Neural Networks, Instance-based learners, etc Instead of relying on a single algorithm, we aggregate the predictions of several methods "Aggregating the judgment of many consistently beats the accuracy of the average member of the group"
Comparison between mode and poll x86 system Prediction performance in the x86 system mode poll 1.0 0.909 0.892 0.869 0.807 0.791 0.782 0.774 0.8 0.754 0.663 Accuracy 0.602 0.6 0.4 0.2 0.0 0 1 2 3 4 Segment
❈❍❆▲▲❊◆●❊❙
Is the singularity really near? Nick Bostrom - Superintelligence Yuval Noah Harari - 21 Lessons for the 21st Century
Employment
Employment
Flexibility and care Kai-Fu Lee - AI Super-powers - China, Silicon Valley and the New World Order
Knowledge https://xkcd.com/1838/
http://tylervigen.com/view_correlation?id=359
http://tylervigen.com/view_correlation?id=1703
https://xkcd.com/552/
Judea Pearl - The book of why Pedro Domingos - The Master Algorithm
❖PP❖❘❚❯◆■❚■❊❙
AI
HPC AI
HPC AI App
HPC AI Agri
IBM Cloud
IBM to launch AI research center in Brazil
HPML 2019 High Performance Machine Learning Workshop @ IEEE/ACM CCGrid - Cyprus http:// hpml2019.github.io
Recommend
More recommend