High Performance Computing with do doAzur ureP ePar arallel allel Using Azure as your Parallel-Backend for Microsoft Embarassingly Parallel work JS Tan
Azure Big Compute
Azure Infrastructure Commodity, most Fast processors, Fast processors, Most memory, Intel HPC/Low Latency GPU enabled value for cost higher memory-to- lower-memory to Xeon processors VMs for compute VMs for core ratio, SSDs core ratio, SSDs intensive workloads Visualization/ Compute
What is Batch? Many individual tasks APP Many computers/VMs Tasks are assigned to computers/VMs
Scenarios • A quant back-testing portfolio strategies • A data scientist optimizing their model & parameter tuning • A life-science researcher doing genome sequencing
What do they have in common? • Scale – computationally expensive work - need to scale up in order to get results back quickly • Minimal IT Management – the user is the domain specialist, not an IT specialist • Elastic compute – temporary need for a lot of capacity • Cost effective – low cost strategies are important! + They are all probably using R…
doAzureParallel is... A R package that uses Azure as a parallel-backend for popular open source tools to use – foreach, caret, dplyr, etc.
Foreach using doAzureParallel foreach (i = 1:100) %dopar% { myParallelAlgorithm(...) } Microsoft Azure
doAzureParallel on Azure Batch Azure Batch is a platform service that provides easy job scheduling and cluster management, allowing applications or algorithms to run in parallel at scale. • Capacity on demand; jobs on demand • Autoscale (more on this later) • Minimal cluster management (node failure, install, etc) • Hardware choice – use any VM size • Pay by the minute • Cost effective – no charge for using it, you only pay for the VMs • More cost effective – low priority VMs (more on this later) If you want to run jobs using elastic compute, Batch is a great fit!
Scale • From 1 to 10,000 VMs for a cluster • From 1 to millions of tasks • Your selection of hardware: • General compute VMs (A-Series / D-Series) • Memory / storage optimized (G-Series) • Compute Optimized (F-Series) • GPU enabled (N-Series) • Results from computing the mandelbrot set when scaling up: 10 parallel 20 parallel Local 5 parallel workers workers machine workers
Minimal Cluster Management • Abstract away complex Azure/cloud concepts • Zero IT-level management • Work entirely in R Studio • Monitor / Debug your jobs directly in R studio • Manage your cluster and multiple jobs directly in R studio • The results of your distributed, large scale work can be returned directly to your R session
Minimal code change • Minimal code change to use doAzureParallel • Easy to use and you can get started in just a few lines of code
Elastic Compute • Compute on-demand • Create/delete your cluster as you need • Autoscaling pool = maximizing cloud elasticity • Long running batch jobs / overnight • Daily scheduled work – pre-provision cluster so its ready for you at the beginning of the day • Bursty work
Cost Effective • Low-Priority = (extremely) Low Costs • Provisioning VMs from Azure’s surplus capacity at 80% discount • Your Azure cluster can contain both regular (dedicated) VMs and low-priority VMs Azure Batch Low Priority VMs Dedicated VMs at up to 80% discount My Local R Session
Cost Effective: More about Low Priority When should I use it? • Long running work that can be broken into smaller pieces and work that doesn't have a strict time limit to complete • Experimentation, testing, evaluating models What you need to know when using it: • Possibility that Azure • will not allocate your VMs OR • that it will take some or all of the capacity back • If a node is pre-empted • Azure Batch will replace your node for you • Azure Batch will reschedule your work so that you job can successfully complete
Low Priority Scenarios Preempted Dedicated Low-priority Lower cost Lower cost + + maintaining capacity w/ autoscale Lowest Cost guaranteed baseline capacity Azure Batch Pool Azure Batch Pool Azure Batch Pool Capacity Capacity Capacity Time Time Time
Questions? www.github.com/azure/doazureparallel https://aka.ms/earl2017
What’s new with doAzureParallel? • Low priority support a • Richer Job Management experience a • Resource Files to preload data a • Parameter Tuning integration with Caret a • Simple connector to Azure Blob Storage a
R + Azure Batch So what R workloads work great on Azure Batch? • Simulation based work (VaR calculation, back-testing, monte-carlo simulations, financial modelling) • Parameter Tuning / Model Evaluation (grid search, random search, cross validation, etc) • Computing against data / ETL jobs / Data-prep jobs What industries / verticals might be interested in using this? • Financial Services • Education & Research • Sports analytics
doAzureParallel (since initial release) • Initial release in March • Grass roots strategy • End-user focused • Financial Services targeted / key messaging has been around simulation based work • Interest from the field • Feedback
Azure Batch Low Priority VMs Dedicated VMs at up to 80% discount My Local R Session
Recommend
More recommend