heterogeneous datacenters options and opportunities
play

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - PowerPoint PPT Presentation

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc. Data Center Energy Consumption is a Big Deal In 2013 , U.S.


  1. Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc.

  2. Data Center Energy Consumption is a Big Deal In 2013 , U.S. data centers consumed an estimated 91 billion kilowatt-hours of electricity, projected to increase to roughly 140 billion kilowatt-hours annually by 2020 • 50 large power plants (500-megawatt coal-fired) • $13 billion annually • 100 million metric tons of carbon pollution per year . https://www.nrdc.org/resources/americas-data-centers-consuming-and-wasting-growing- amounts-energy) 2

  3. Extensive Efforts on Improving Datacenter Energy Efficiency ◆ Understand the scale-out workloads § ISCA’10, ASPLOS’12 § Mismatch between workloads and processor designs; § Modern processors are over-provisioning ◆ Trade-off of big-core vs. small-core § ISCA’10: Web-search on small-core with better energy-efficiency § Baidu taps Mavell for ARM storage server SoC 3

  4. Focus of Our Research (since 2008) -- Customization Parallelization Customization Adapt the architecture to Application domain Source: Shekhar Borkar, Intel 4

  5. 5

  6. Computing Industry is Looking at Customization Seriously ◆ FPGA gaining popular among tech giants § Microsoft, IBM, Baidu, etc. Microsoft Catapult Intel HARP ◆ Intel’s acquisition of Altera § Intel prediction: 30% datacenter nodes with FPGA by 2020 6

  7. Contributions of This Paper ◆ Evaluation of different integration options of heterogeneous technologies in datacenters ◆ Efficient programming support for heterogeneous datacenters 7

  8. Small-core on Compute-intensive Workloads 10.97 LR KM ◆ Baselines ◆ Data set EXECUTION TIME NORMALIZED 7.8 § Xeon: Intel E5 2620 § MNIST 700K Samples 5.26 3.13 12 Core CPU 2.40GHz § 784 Features, 10 Labels § Atom: Intel D2500 1.8GHz ◆ Results § ARM: A9 in Zynq 800MHz 8X ARM 8X ATOM § Normalized to reference ◆ Power consumption 6.86 Xeon performance 5.21 NORMALIZED 4.88 (averaged) ENERGY 3.1 § Xeon: 175W/node § Atom: 30W/node 8X ARM 8X ATOM § ARM: 10W/node 8

  9. Small Cores Alone Are Not Efficient! 9

  10. Small Core + ACC: FARM ◆ Boost Small-core Performance with FPGA - 8 Xilinx ZC706 boards - 24-port Ethernet switch - ~100W power 10

  11. Small-core with FPGA Performance ◆ Setup 10.97 LR KM EXECUTION TIME NORMALIZED § Data set 7.8 5.26 • MNIST 700K Samples 3.13 1.06 0.69 • 784 Features, 10 Labels § Power consumption 8X ARM 8X ATOM 8X ZYNQ (averaged) 6.86 • Atom: 30W/node 5.21 4.88 NORMALIZED • ARM: 10W/node ENERGY 3.1 ◆ Results 0.66 0.43 § Normalized to reference 8X ARM 8X ATOM 8X ZYNQ Xeon performance 11

  12. Small Cores + FPGAs Are More Interesting! 12

  13. Inefficiencies in Small-core ◆ Slower core and memory clock § Task scheduling is slow § JVM-to-FPGA data transfer is slow ◆ Limited DRAM size and Ethernet bandwidth § Slow data shuffling between nodes ◆ Another option: Big-core + FPGA 13

  14. Big-Core + ACC: CDSC FPGA-Enabled Cluster ◆ A 24-node cluster with FPGA-based accelerators § Run on top of Spark and Hadoop (HDFS) Alpha Data board: 1. Virtex-7 FPGA 1 master / 2. 16GB on-board driver RAM 1 10GbE switch Each node: 1. Two Xeon processors 2. One FPGA PCIe card (Alpha Data) 3. 64 GB RAM 22 workers 4. 10GBE NIC 1 file server 14

  15. Experimental Results ◆ Experimental setup EXECUTION TIME 1.06 NORMALIZED § Data set 0.69 • MNIST 700K Samples 0.5 0.33 • 784 Features, 10 Labels ◆ Results 1X XEON+AD 8X ZYNQ § Normalized to reference 0.66 0.56 NORMALIZED Xeon performance 0.43 0.38 ENERGY 1X XEON+AD 8X ZYNQ 15

  16. Overall Evaluation Results ◆ Based on two machine learning workloads § Normalized performance (speedup), and energy efficiency (performance/W) relative to big-core solutions Energy- Performance Efficiency Big-Core+FPGA Best | 2.5 Best | 2.6 Small-Core+FPGA Better | 1.2 Best | 1.9 Big-Core Good | 1.0 Good | 1.0 Small-Core Bad | 0.25 Bad | 0.24 16

  17. Contributions of This Paper ◆ Evaluation of different integration options of heterogeneous technologies in datacenters ◆ Efficient programming support for heterogeneous datacenters § Heterogeneity makes programming hard! 17

  18. More Heterogeneity in the Future Intel-Altera HARP AlphaData PCIE FPGA NVidia GPU Datacenter 18

  19. Programming Challenges Lots of setup/ini-aliza-on codes Applica-on Too much hardware-specific knowledge JAVA-FPGA: JNI - Data-transfer between host and accelerator FPGA-as-a-Service - Sharing, isolation - Manual data par--on, task Heterogeneous hardware: scheduling - OpenCL (SDAccel) Only support single applica-on Lack of portability Accelerator-Rich Systems 19

  20. Introducing Blaze Runtime System Application Hadoop Spark AccRDD Client MapRed Blaze Runtime System Node Node - Friendly interface - User-transparent scheduling Node Node FPGA … ACC ACC ACC - Efficient Manager Manager GPU Accelerator-Rich Systems 20

  21. Overview of Deployment Flow Application Designer: User Application ACC Requests Output data Input data Node ACC FPGA Manager GPU ACC register Accelerator Designer: Acc Look-up Table ACC 21

  22. Programming Interface for Application val points = sc.textfile().cache() val points = blaze.wrap (sc.textfile()) for (i <- 1 to ITERATIONS) { for (i <- 1 to ITERATIONS) { val gradient = points.map(p => val gradient = points.map( (1 / (1 + exp(-p.y*(w dot p.x))) new LogisticGrad(w) - 1) * p.y * p.x ).reduce(_ + _) ).reduce(_ + _) w -= gradient w -= gradient } } class LogisticGrad(..) extends Accelerator[T, U] { Spark.mllib Integration val id: String = “Logistic” def call(in: T): U = {p => • No user code changes or (1 / (1 + exp(-p.y*(w dot p.x))) recompilation - 1) * p.y * p.x } } 22

  23. Under the Hood: Getting data to FPGA Spark Task NAM FPGA device Inter-process PCIE memcpy memcpy 3.84% 17.99% Solutions Receive data - Data caching 28.06% Data preprocessing - Pipelining Data transfer 17.99% FPGA computa-on Other 32.13% 23

  24. Programming Interface for Accelerator class LogisticACC : public Task // extend the basic Task interface { LogisticACC(): Task(2) {;} // specify # of inputs // overwrite the compute function virtual void compute() { // get input/output using provided APIs int num_sample = getInputNumItems(0); int data_length = getInputLength(0) / num_sample; int weight_size = getInputLength(1); double *data = (double*)getInput(0); double *weights = (double*)getInput(1); double *grad = (double*)getOutput(0, weight_size, sizeof(double)); double *loss = (double*)getOutput(1, 1, sizeof(double)); // perform computation RuntimeClient runtimeClient; LogisticApp theApp(out, in, data_length*sizeof(double), &runtimeClient); theApp.run(); } }; Compile to ACC_Task (*.so) 24

  25. Complete Accelerator Management Solution ◆ Global Accelerator Manager (GAM) § Global resource allocation within a cluster to optimize system throughput and accelerator utilization ◆ Node Accelerator Manager (NAM) § Virtualize accelerators for application tasks § Provide accelerator sharing/isolation Other Spark MPI MapReduce frameworks Global Acc Manager Yarn Distributed File System (HDFS) Node Node Node ... Node Acc Manager Node Acc Manager 25 acc … acc acc … acc

  26. Global Accelerator Management ◆ Places accelerable application to the nodes w/ accelerators § Minimizes FPGA reprogramming overhead by accelerator-locality aware scheduling § è less reprogramming & less straggler effect Naïve allocation: multiple applications share the Better allocation: applications on the node use same FPGA, frequent reprogramming needed accelerator, no reprogramming needed ◆ Heartbeats with node accelerator manager to collect accelerator status 26

  27. Results of GAM Optimizations static partition naïve sharing GAM static partition: manual 1.91 1.91 1.89 1.93 cluster partition, KM Throughput 1.41 cluster and LR cluster 1.36 1.23 1.22 1.17 0.98 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.90 0.88 0.70 naïve sharing: 0.59 workloads are free to run on any nodes 1 0.8 0.6 0.5 0.4 0.2 0 Ratio of LR workloads in the mixed LR-KM workloads 27

  28. Productivity and Efficiency of Blaze Lines of Code (LOC) Application Accelerator setup Partial FaaS Logistic Regression 26 à 9 104 à 99 325 à 0 Kmeans 37 à 7 107 à 103 364 à 0 Compression 1 à 1 70 à 65 360 à 0 Genomics 1 à 1 227 à 142 896 à 0 The accelerator Partial FaaS does not support kernels are wrapped accelerator sharing among different as a function applications 28

  29. Blaze Overhead Analysis ◆ Blaze has overhead compared to manual design ◆ With multiple threads the overheads can be efficiently mitigated 1.2 200 Execution Time (ms) JVM-to-native 180 Normalized Throughput 1 160 to Manual Design Native-to-FPGA 140 0.8 120 FPGA-kernel Software 0.6 100 Manual 80 Native-private-to- 0.4 Blaze share 60 40 0.2 20 0 0 1 thread 12 threads Manual Blaze 29

  30. Acknowledgements – CDSC and C-FAR Center for Domain-Specific Computing (CDSC) under the NSF Expeditions in ♦ Computing Program C-FAR Center under the STARnet Program ♦ Yuting Chen Hui Huang Muhuan Huang (UCLA) (UCLA) (UCLA) Cody Hao Yu Di Wu Bingjun Xiao (UCLA) (UCLA) (UCLA) 30

Recommend


More recommend