ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu 1
POPULARITY OF GPUS IN COMPUTING DOMAINS § GPUs are popular in two major compute domains: § HPC : Supercomputing, cloud engines § SoC : Low power handheld devices § Growth of GPUs in HPC: § Avg. 10% GPU based supercomputers every year in Top 500 listing 1 § Titan supercomputer records 27.1 PFLOPS Source: http://top500.org/ using Nvidia K20x GPUs § Growth in the SoC market: Shipments (in Millions) § 2.1x increase in GPU based SoC shipments every year § Projected to increase to 4x annually 2
PERFORMANCE: CPU VS. GPU (DP FLOPS) SourceKarl Rupp’s website: www.karlrupp.net 3
RANGE OF APPLICATIONS FOR MODERN GPUS Traditional Compute • Data parallel • Good scalability • Example applications: FFT, matrix multiply, NBody, convolution System Operations and Security Machine Learning • Varied kernel sizes • Data dependent kernel launch Modern • Non-deterministic launch order • Multiple situation-based kernels GPU • Independent compute kernels • Fragmented parallelism Applications with same data • E.g., outlier detection, DNN, • E.g., encryption, garbage Hidden Markov Models collection, file-system tasks Signal Processing Irregular computations • Stage based computation kernels • Synchronization (Sorting) • Kernel replication (channels) • Nested parallelism (Graphs • Performance bound (video) search) • E.g., Filtering, H264 audio, JPEG • E.g., graph traversal, GPU compression, video processing based sorting, list intersection 4
KEY FEATURES REQUIRED IN GPUS TO SUPPORT MODERN APPLICATIONS § Collaborative Execution ü Leverage multiple GPUs and the CPU concurrent to run a single problem § Applying Machine Learning approaches to tuning power/perf ü Machine learning algorithms are being used everywhere § Load Balancing ü Prevent starvation when executing multiple kernels together § Multiple Application Hosting ü Beneficial for cloud GPUs to mitigate user load and improve power § Time, Power and Performance QoS Objectives ü Maintain deadlines for low-latency applications 5
MACHINE LEARNING BASED SCHEDULING FRAMEWORK FOR GPUS § GPU’s are already appearing as cloud-based instances: § Amazon Web Services § Nimbix § Peer1 Hosting § IBM Cloud § SkyScale § SoftLayer § Co-executing applications on GPUs may interfere with each other § Interference leads to individual slowdown and reduces system throughput ü Mystic: Framework for interference aware scheduling of applications on GPU clusters/clouds using machine learning 6
INTERFERENCE ON MULTI-TASKED GPUS § Interference is any performance degradation caused by resource contention and conflicts between co-executing applications § 45 applications launched on 8-GPU cluster using GPU-remoting § Least Loaded (LL) and Round Robin (RR) scheduler used § 41% avg. slowdown over dedicated § 8 applications show 2.5x slowdown § Only 6 applications achieve QoS (80% of dedicated performance) 7
MYSTIC : MACHINE LEARNING BASED INTERFERENCE DETECTION § Mystic is a layer implemented on head node of a cluster § Stage-1 § Initialize Mystic § Create status entry for incoming applications § Collect short, limited profiles for new apps. § Stage-2 § Predict missing metrics for short profiles using Collaborative Filtering with SVD and the training matrix § Fill PRT with predicted performance values § Stage-3 § Detect similarity between current and existing applications using PRT and MAST § Generate interference scores § Select co-execution candidates 8
EVALUATION METHODOLOGY Suite # Name of Applications Apps § Mystic evaluated on private cluster PolyBench 14 2dConv, 3dConv, 3mm, Atax, Bivg, running rCUDA [UPV] Gemm, Gesummv, Gramschmidt, -GPU mvt, Syr2k, Syrk, Correlation, Covariance, FDTD-2d § Node configuration SHOC-GPU 12 BFS-Shoc, FFT, MD, MD5Hash, NueralNet, Scan, Sort, Reduction, § Xeon E5-2695 CPU Spmv, Stencil-2D, Triad, qt- Clustering § 2 Nvidia K40m GPUs Lonestar- 6 BHNbody, BFS, LS, MST, DMR, § 24GB/16GB memory (head/compute) SP, Sssp GPU § IB FDR connect NUPAR 5 CCL, LSS, HMM, LoKDR, IIR § 12GB GPU DRAM per device openFOAM 8 Dsmc, PDR, thermo, buoyantS, mhd, simpleFoam, sprayFoam, driftFlux § 42 applications for training matrix ASC Proxy 4 CoMD, LULESH, miniFE, miniMD (TRM) creation Apps Caffe- 3 leNet, logReg, fineTUne § 55 applications for testing Mystic cuDNN MAGMA 3 Dasaxpycp, strstm, dgesv § 100 random launch sequences 9
MYSTIC SCHEDULER PERFORMANCE • 90.2% applications achieve QoS with Mystic • 75% applications show • LL experiences less than 15% severe degradation degradation with Mystic • Short-running • RR achieves QoS for interfering apps are 21% applications co-scheduled • Mystic scales effectively as we increase the number of nodes • LL utilizes GPUs more effectively than RR for fewer nodes 10
SUMMARIZING THE MYSTIC SCHEDULER § Mystic is an interference aware scheduler for GPU clusters § Collaborative filtering used to predict the performance of incoming applications § Mystic assigns applications for co-execution such that it minimizes the interference § Evaluated Mystic on 16 node cluster with 55 applications and 100 launch sequences “Mystic: Predictive Scheduling for GPU based Cloud Servers Using Machine Learning,” U. Ukidave, X. Li and D. Kaeli, 30 th IEEE International Parallel and Distributed Processing Symposium, May, 2016. 11
AIRAVAT: IMPROVING THE ENERGY EFFICIENCY OF HETEROGENEOUS SYSTEMS/APPLICATIONS § New platforms that can support concurrent CPU and GPU execution § Peak performance/power can be achieved when both devices can be used collaboratively § Emerging heterogeneous platforms (APUs, Jetson, etc.) have the CPU and GPU share a single memory § Multiple/shared clock domains for CPU, GPU and memory § Airavat provides a framework that can improve the Energy Delay Product on NVIDIA Jetson GPUs, providing significant power/performance benefits when compared to the NVIDIA baseline power manager 12
AIRAVAT: 2-LEVEL POWER MANAGEMENT FRAMEWORK Frequency Approximation Layer • Uses Random Forest based prediction to select the best frequency tuple for the application Collaborative Tuning Layer • Leverages run-time feedback based on performance counters to fine-tune CPU/GPU running collaboratively 13
• Average EDP reduction is 24% SUMMARY OF AIRAVAT and can be as high as 70% when compared to the baseline power manager • FAL provides up to 19% energy savings with an additional 5% from CTL • Airavat achieves an application speedup of 1.24x times • The EDP difference when compared to a oracle power manager is 7% • Airavat causes power loss of 4% since it trades energy and performance for higher power *Airavat: Improving Energy Efficiency of Heterogeneous Applications – Accepted to DATE 2018 14
OTHER AREAS OF ONGOING RESEARCH IN NUCAR § Multi2Sim simulator – Kepler CPU/GPU timing simulator – ISPASS’17, HSAsim ongoing § Multi-GPU and GPU-based NoC designs – HIPEAC’17 § NICE – Northeastern Interactive Clustering Engine, NIPS’17 § PCM reliability – DATE’17, DSN’17 – with Penn St. § GPU reliability – SELSE’16, SELSE’17, MWCSC’17, DATE’18 § GPU Hardware Transactional Memory EuroPar’17, IEEETC – with U. of Malaga § GPU compilation – CGO’17, ongoing with with U. of Ghent § GPU scalar execution – ISPASS’13, IPDPS’14, IPDPS’16 § GPU recursive execution on graphs – APG’17 15
THE NORTHEASTERN UNIVERSITY COMPUTER ARCHITECTURE RESEARCH LAB 16
QUESTIONS? THANKS TO OUR SPONSORS … .. 17
Recommend
More recommend