accelerator scheduling and management using a
play

ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM - PowerPoint PPT Presentation

ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu 1 POPULARITY OF GPUS IN COMPUTING DOMAINS GPUs are


  1. ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu 1

  2. POPULARITY OF GPUS IN COMPUTING DOMAINS § GPUs are popular in two major compute domains: § HPC : Supercomputing, cloud engines § SoC : Low power handheld devices § Growth of GPUs in HPC: § Avg. 10% GPU based supercomputers every year in Top 500 listing 1 § Titan supercomputer records 27.1 PFLOPS Source: http://top500.org/ using Nvidia K20x GPUs § Growth in the SoC market: Shipments (in Millions) § 2.1x increase in GPU based SoC shipments every year § Projected to increase to 4x annually 2

  3. PERFORMANCE: CPU VS. GPU (DP FLOPS) SourceKarl Rupp’s website: www.karlrupp.net 3

  4. RANGE OF APPLICATIONS FOR MODERN GPUS Traditional Compute • Data parallel • Good scalability • Example applications: FFT, matrix multiply, NBody, convolution System Operations and Security Machine Learning • Varied kernel sizes • Data dependent kernel launch Modern • Non-deterministic launch order • Multiple situation-based kernels GPU • Independent compute kernels • Fragmented parallelism Applications with same data • E.g., outlier detection, DNN, • E.g., encryption, garbage Hidden Markov Models collection, file-system tasks Signal Processing Irregular computations • Stage based computation kernels • Synchronization (Sorting) • Kernel replication (channels) • Nested parallelism (Graphs • Performance bound (video) search) • E.g., Filtering, H264 audio, JPEG • E.g., graph traversal, GPU compression, video processing based sorting, list intersection 4

  5. KEY FEATURES REQUIRED IN GPUS TO SUPPORT MODERN APPLICATIONS § Collaborative Execution ü Leverage multiple GPUs and the CPU concurrent to run a single problem § Applying Machine Learning approaches to tuning power/perf ü Machine learning algorithms are being used everywhere § Load Balancing ü Prevent starvation when executing multiple kernels together § Multiple Application Hosting ü Beneficial for cloud GPUs to mitigate user load and improve power § Time, Power and Performance QoS Objectives ü Maintain deadlines for low-latency applications 5

  6. MACHINE LEARNING BASED SCHEDULING FRAMEWORK FOR GPUS § GPU’s are already appearing as cloud-based instances: § Amazon Web Services § Nimbix § Peer1 Hosting § IBM Cloud § SkyScale § SoftLayer § Co-executing applications on GPUs may interfere with each other § Interference leads to individual slowdown and reduces system throughput ü Mystic: Framework for interference aware scheduling of applications on GPU clusters/clouds using machine learning 6

  7. INTERFERENCE ON MULTI-TASKED GPUS § Interference is any performance degradation caused by resource contention and conflicts between co-executing applications § 45 applications launched on 8-GPU cluster using GPU-remoting § Least Loaded (LL) and Round Robin (RR) scheduler used § 41% avg. slowdown over dedicated § 8 applications show 2.5x slowdown § Only 6 applications achieve QoS (80% of dedicated performance) 7

  8. MYSTIC : MACHINE LEARNING BASED INTERFERENCE DETECTION § Mystic is a layer implemented on head node of a cluster § Stage-1 § Initialize Mystic § Create status entry for incoming applications § Collect short, limited profiles for new apps. § Stage-2 § Predict missing metrics for short profiles using Collaborative Filtering with SVD and the training matrix § Fill PRT with predicted performance values § Stage-3 § Detect similarity between current and existing applications using PRT and MAST § Generate interference scores § Select co-execution candidates 8

  9. EVALUATION METHODOLOGY Suite # Name of Applications Apps § Mystic evaluated on private cluster PolyBench 14 2dConv, 3dConv, 3mm, Atax, Bivg, running rCUDA [UPV] Gemm, Gesummv, Gramschmidt, -GPU mvt, Syr2k, Syrk, Correlation, Covariance, FDTD-2d § Node configuration SHOC-GPU 12 BFS-Shoc, FFT, MD, MD5Hash, NueralNet, Scan, Sort, Reduction, § Xeon E5-2695 CPU Spmv, Stencil-2D, Triad, qt- Clustering § 2 Nvidia K40m GPUs Lonestar- 6 BHNbody, BFS, LS, MST, DMR, § 24GB/16GB memory (head/compute) SP, Sssp GPU § IB FDR connect NUPAR 5 CCL, LSS, HMM, LoKDR, IIR § 12GB GPU DRAM per device openFOAM 8 Dsmc, PDR, thermo, buoyantS, mhd, simpleFoam, sprayFoam, driftFlux § 42 applications for training matrix ASC Proxy 4 CoMD, LULESH, miniFE, miniMD (TRM) creation Apps Caffe- 3 leNet, logReg, fineTUne § 55 applications for testing Mystic cuDNN MAGMA 3 Dasaxpycp, strstm, dgesv § 100 random launch sequences 9

  10. MYSTIC SCHEDULER PERFORMANCE • 90.2% applications achieve QoS with Mystic • 75% applications show • LL experiences less than 15% severe degradation degradation with Mystic • Short-running • RR achieves QoS for interfering apps are 21% applications co-scheduled • Mystic scales effectively as we increase the number of nodes • LL utilizes GPUs more effectively than RR for fewer nodes 10

  11. SUMMARIZING THE MYSTIC SCHEDULER § Mystic is an interference aware scheduler for GPU clusters § Collaborative filtering used to predict the performance of incoming applications § Mystic assigns applications for co-execution such that it minimizes the interference § Evaluated Mystic on 16 node cluster with 55 applications and 100 launch sequences “Mystic: Predictive Scheduling for GPU based Cloud Servers Using Machine Learning,” U. Ukidave, X. Li and D. Kaeli, 30 th IEEE International Parallel and Distributed Processing Symposium, May, 2016. 11

  12. AIRAVAT: IMPROVING THE ENERGY EFFICIENCY OF HETEROGENEOUS SYSTEMS/APPLICATIONS § New platforms that can support concurrent CPU and GPU execution § Peak performance/power can be achieved when both devices can be used collaboratively § Emerging heterogeneous platforms (APUs, Jetson, etc.) have the CPU and GPU share a single memory § Multiple/shared clock domains for CPU, GPU and memory § Airavat provides a framework that can improve the Energy Delay Product on NVIDIA Jetson GPUs, providing significant power/performance benefits when compared to the NVIDIA baseline power manager 12

  13. AIRAVAT: 2-LEVEL POWER MANAGEMENT FRAMEWORK Frequency Approximation Layer • Uses Random Forest based prediction to select the best frequency tuple for the application Collaborative Tuning Layer • Leverages run-time feedback based on performance counters to fine-tune CPU/GPU running collaboratively 13

  14. • Average EDP reduction is 24% SUMMARY OF AIRAVAT and can be as high as 70% when compared to the baseline power manager • FAL provides up to 19% energy savings with an additional 5% from CTL • Airavat achieves an application speedup of 1.24x times • The EDP difference when compared to a oracle power manager is 7% • Airavat causes power loss of 4% since it trades energy and performance for higher power *Airavat: Improving Energy Efficiency of Heterogeneous Applications – Accepted to DATE 2018 14

  15. OTHER AREAS OF ONGOING RESEARCH IN NUCAR § Multi2Sim simulator – Kepler CPU/GPU timing simulator – ISPASS’17, HSAsim ongoing § Multi-GPU and GPU-based NoC designs – HIPEAC’17 § NICE – Northeastern Interactive Clustering Engine, NIPS’17 § PCM reliability – DATE’17, DSN’17 – with Penn St. § GPU reliability – SELSE’16, SELSE’17, MWCSC’17, DATE’18 § GPU Hardware Transactional Memory EuroPar’17, IEEETC – with U. of Malaga § GPU compilation – CGO’17, ongoing with with U. of Ghent § GPU scalar execution – ISPASS’13, IPDPS’14, IPDPS’16 § GPU recursive execution on graphs – APG’17 15

  16. THE NORTHEASTERN UNIVERSITY COMPUTER ARCHITECTURE RESEARCH LAB 16

  17. QUESTIONS? THANKS TO OUR SPONSORS … .. 17

Recommend


More recommend