Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - PowerPoint PPT Presentation

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc.

Data Center Energy Consumption is a Big Deal In 2013 , U.S. data centers consumed an estimated 91 billion kilowatt-hours of electricity, projected to increase to roughly 140 billion kilowatt-hours annually by 2020 • 50 large power plants (500-megawatt coal-fired) • $13 billion annually • 100 million metric tons of carbon pollution per year . https://www.nrdc.org/resources/americas-data-centers-consuming-and-wasting-growing- amounts-energy) 2

Extensive Efforts on Improving Datacenter Energy Efficiency ◆ Understand the scale-out workloads § ISCA’10, ASPLOS’12 § Mismatch between workloads and processor designs; § Modern processors are over-provisioning ◆ Trade-off of big-core vs. small-core § ISCA’10: Web-search on small-core with better energy-efficiency § Baidu taps Mavell for ARM storage server SoC 3

Focus of Our Research (since 2008) -- Customization Parallelization Customization Adapt the architecture to Application domain Source: Shekhar Borkar, Intel 4

Computing Industry is Looking at Customization Seriously ◆ FPGA gaining popular among tech giants § Microsoft, IBM, Baidu, etc. Microsoft Catapult Intel HARP ◆ Intel’s acquisition of Altera § Intel prediction: 30% datacenter nodes with FPGA by 2020 6

Contributions of This Paper ◆ Evaluation of different integration options of heterogeneous technologies in datacenters ◆ Efficient programming support for heterogeneous datacenters 7

Small-core on Compute-intensive Workloads 10.97 LR KM ◆ Baselines ◆ Data set EXECUTION TIME NORMALIZED 7.8 § Xeon: Intel E5 2620 § MNIST 700K Samples 5.26 3.13 12 Core CPU 2.40GHz § 784 Features, 10 Labels § Atom: Intel D2500 1.8GHz ◆ Results § ARM: A9 in Zynq 800MHz 8X ARM 8X ATOM § Normalized to reference ◆ Power consumption 6.86 Xeon performance 5.21 NORMALIZED 4.88 (averaged) ENERGY 3.1 § Xeon: 175W/node § Atom: 30W/node 8X ARM 8X ATOM § ARM: 10W/node 8

Small Cores Alone Are Not Efficient! 9

Small Core + ACC: FARM ◆ Boost Small-core Performance with FPGA - 8 Xilinx ZC706 boards - 24-port Ethernet switch - ~100W power 10

Small-core with FPGA Performance ◆ Setup 10.97 LR KM EXECUTION TIME NORMALIZED § Data set 7.8 5.26 • MNIST 700K Samples 3.13 1.06 0.69 • 784 Features, 10 Labels § Power consumption 8X ARM 8X ATOM 8X ZYNQ (averaged) 6.86 • Atom: 30W/node 5.21 4.88 NORMALIZED • ARM: 10W/node ENERGY 3.1 ◆ Results 0.66 0.43 § Normalized to reference 8X ARM 8X ATOM 8X ZYNQ Xeon performance 11

Small Cores + FPGAs Are More Interesting! 12

Inefficiencies in Small-core ◆ Slower core and memory clock § Task scheduling is slow § JVM-to-FPGA data transfer is slow ◆ Limited DRAM size and Ethernet bandwidth § Slow data shuffling between nodes ◆ Another option: Big-core + FPGA 13

Big-Core + ACC: CDSC FPGA-Enabled Cluster ◆ A 24-node cluster with FPGA-based accelerators § Run on top of Spark and Hadoop (HDFS) Alpha Data board: 1. Virtex-7 FPGA 1 master / 2. 16GB on-board driver RAM 1 10GbE switch Each node: 1. Two Xeon processors 2. One FPGA PCIe card (Alpha Data) 3. 64 GB RAM 22 workers 4. 10GBE NIC 1 file server 14

Experimental Results ◆ Experimental setup EXECUTION TIME 1.06 NORMALIZED § Data set 0.69 • MNIST 700K Samples 0.5 0.33 • 784 Features, 10 Labels ◆ Results 1X XEON+AD 8X ZYNQ § Normalized to reference 0.66 0.56 NORMALIZED Xeon performance 0.43 0.38 ENERGY 1X XEON+AD 8X ZYNQ 15

Contributions of This Paper ◆ Evaluation of different integration options of heterogeneous technologies in datacenters ◆ Efficient programming support for heterogeneous datacenters § Heterogeneity makes programming hard! 17

More Heterogeneity in the Future Intel-Altera HARP AlphaData PCIE FPGA NVidia GPU Datacenter 18

Programming Challenges Lots of setup/ini-aliza-on codes Applica-on Too much hardware-specific knowledge JAVA-FPGA: JNI - Data-transfer between host and accelerator FPGA-as-a-Service - Sharing, isolation - Manual data par--on, task Heterogeneous hardware: scheduling - OpenCL (SDAccel) Only support single applica-on Lack of portability Accelerator-Rich Systems 19

Introducing Blaze Runtime System Application Hadoop Spark AccRDD Client MapRed Blaze Runtime System Node Node - Friendly interface - User-transparent scheduling Node Node FPGA … ACC ACC ACC - Efficient Manager Manager GPU Accelerator-Rich Systems 20

Overview of Deployment Flow Application Designer: User Application ACC Requests Output data Input data Node ACC FPGA Manager GPU ACC register Accelerator Designer: Acc Look-up Table ACC 21

Programming Interface for Application val points = sc.textfile().cache() val points = blaze.wrap (sc.textfile()) for (i <- 1 to ITERATIONS) { for (i <- 1 to ITERATIONS) { val gradient = points.map(p => val gradient = points.map( (1 / (1 + exp(-p.y*(w dot p.x))) new LogisticGrad(w) - 1) * p.y * p.x ).reduce(_ + _) ).reduce(_ + _) w -= gradient w -= gradient } } class LogisticGrad(..) extends Accelerator[T, U] { Spark.mllib Integration val id: String = “Logistic” def call(in: T): U = {p => • No user code changes or (1 / (1 + exp(-p.y*(w dot p.x))) recompilation - 1) * p.y * p.x } } 22

Under the Hood: Getting data to FPGA Spark Task NAM FPGA device Inter-process PCIE memcpy memcpy 3.84% 17.99% Solutions Receive data - Data caching 28.06% Data preprocessing - Pipelining Data transfer 17.99% FPGA computa-on Other 32.13% 23

Programming Interface for Accelerator class LogisticACC : public Task // extend the basic Task interface { LogisticACC(): Task(2) {;} // specify # of inputs // overwrite the compute function virtual void compute() { // get input/output using provided APIs int num_sample = getInputNumItems(0); int data_length = getInputLength(0) / num_sample; int weight_size = getInputLength(1); double *data = (double*)getInput(0); double *weights = (double*)getInput(1); double *grad = (double*)getOutput(0, weight_size, sizeof(double)); double *loss = (double*)getOutput(1, 1, sizeof(double)); // perform computation RuntimeClient runtimeClient; LogisticApp theApp(out, in, data_length*sizeof(double), &runtimeClient); theApp.run(); } }; Compile to ACC_Task (*.so) 24

Complete Accelerator Management Solution ◆ Global Accelerator Manager (GAM) § Global resource allocation within a cluster to optimize system throughput and accelerator utilization ◆ Node Accelerator Manager (NAM) § Virtualize accelerators for application tasks § Provide accelerator sharing/isolation Other Spark MPI MapReduce frameworks Global Acc Manager Yarn Distributed File System (HDFS) Node Node Node ... Node Acc Manager Node Acc Manager 25 acc … acc acc … acc

Global Accelerator Management ◆ Places accelerable application to the nodes w/ accelerators § Minimizes FPGA reprogramming overhead by accelerator-locality aware scheduling § è less reprogramming & less straggler effect Naïve allocation: multiple applications share the Better allocation: applications on the node use same FPGA, frequent reprogramming needed accelerator, no reprogramming needed ◆ Heartbeats with node accelerator manager to collect accelerator status 26

Results of GAM Optimizations static partition naïve sharing GAM static partition: manual 1.91 1.91 1.89 1.93 cluster partition, KM Throughput 1.41 cluster and LR cluster 1.36 1.23 1.22 1.17 0.98 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.90 0.88 0.70 naïve sharing: 0.59 workloads are free to run on any nodes 1 0.8 0.6 0.5 0.4 0.2 0 Ratio of LR workloads in the mixed LR-KM workloads 27

Productivity and Efficiency of Blaze Lines of Code (LOC) Application Accelerator setup Partial FaaS Logistic Regression 26 à 9 104 à 99 325 à 0 Kmeans 37 à 7 107 à 103 364 à 0 Compression 1 à 1 70 à 65 360 à 0 Genomics 1 à 1 227 à 142 896 à 0 The accelerator Partial FaaS does not support kernels are wrapped accelerator sharing among different as a function applications 28

Blaze Overhead Analysis ◆ Blaze has overhead compared to manual design ◆ With multiple threads the overheads can be efficiently mitigated 1.2 200 Execution Time (ms) JVM-to-native 180 Normalized Throughput 1 160 to Manual Design Native-to-FPGA 140 0.8 120 FPGA-kernel Software 0.6 100 Manual 80 Native-private-to- 0.4 Blaze share 60 40 0.2 20 0 0 1 thread 12 threads Manual Blaze 29

Acknowledgements – CDSC and C-FAR Center for Domain-Specific Computing (CDSC) under the NSF Expeditions in ♦ Computing Program C-FAR Center under the STARnet Program ♦ Yuting Chen Hui Huang Muhuan Huang (UCLA) (UCLA) (UCLA) Cody Hao Yu Di Wu Bingjun Xiao (UCLA) (UCLA) (UCLA) 30

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - PowerPoint PPT Presentation

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc. Data Center Energy Consumption is a Big Deal In 2013 , U.S.

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Exotic Options: An Overview Exotic options: Options whose characteristics vary from standard call

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou, Nick Bambos and

Phoenix: A Constraint-aware Scheduler for Heterogeneous Datacenters Prashanth Thinakaran ,

Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters Christina Delimitrou & Christos

Workloads with Heterogeneous Programmable Datacenters Anton Burtsev, Alex Veidenbaum

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Prometheus @ Datacenters Why Modbus Is Even Worse than SNMP Richard Hartmann, RichiH@ {

Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Google is Really Different. The Dalles, OR (2006) Huge Datacenters in 25+ Worldwide

Functional Error Handling 1 / 13 Whats right with exceptions? Exceptions provide a way to

Optimal Portfolio Application with Double-Uniform Jump Model Floyd B. Hanson and Zongwu Zhu

trs r trs

ECONOMIC FLUCTUATIONS AND UNEMPLOYMENT Lecture 9 Unit 13 (13.1-13.7) (Section 13.6 ignore the

Neutrinoless double beta decay experiments Marisa Pedretti Lawrence Livermore National

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

EE 109 Appendix G Emulating FP 2 USING INTEGERS TO EMULATE DECIMAL/FRACTION ARITHMETIC 3 FP

Analytical Pricing of Asian Options under a Hyper-Exponential Jump Diffusion Model Ning Cai

Sambuz

Useful Links

Newsletter

Mail Us

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , - PowerPoint PPT Presentation

Heterogeneous Datacenters: Options and Opportunities Jason Cong 1 , Muhuan Huang 1,2 , Di Wu 1,2 , Cody Hao Yu 1 1 Computer Science Department, UCLA 2 Falcon Computing Solutions, Inc. Data Center Energy Consumption is a Big Deal In 2013 , U.S.

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Exotic Options: An Overview Exotic options: Options whose characteristics vary from standard call

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou, Nick Bambos and

Phoenix: A Constraint-aware Scheduler for Heterogeneous Datacenters Prashanth Thinakaran ,

Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters Christina Delimitrou &amp; Christos

Workloads with Heterogeneous Programmable Datacenters Anton Burtsev, Alex Veidenbaum

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Prometheus @ Datacenters Why Modbus Is Even Worse than SNMP Richard Hartmann, RichiH@ {

Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Google is Really Different. The Dalles, OR (2006) Huge Datacenters in 25+ Worldwide

Functional Error Handling 1 / 13 Whats right with exceptions? Exceptions provide a way to

Optimal Portfolio Application with Double-Uniform Jump Model Floyd B. Hanson and Zongwu Zhu

trs r trs

ECONOMIC FLUCTUATIONS AND UNEMPLOYMENT Lecture 9 Unit 13 (13.1-13.7) (Section 13.6 ignore the

Neutrinoless double beta decay experiments Marisa Pedretti Lawrence Livermore National

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

EE 109 Appendix G Emulating FP 2 USING INTEGERS TO EMULATE DECIMAL/FRACTION ARITHMETIC 3 FP

Analytical Pricing of Asian Options under a Hyper-Exponential Jump Diffusion Model Ning Cai

Sambuz

Useful Links

Newsletter

Mail Us

Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters Christina Delimitrou & Christos