are low power socs feasible for heterogenous hpc workloads
play

Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max - PowerPoint PPT Presentation

Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max Plauth and Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute, University of Potsdam, Germany Introduction Operating Systems and Middleware Group


  1. Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max Plauth and Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute, University of Potsdam, Germany

  2. Introduction ■ Operating Systems and Middleware Group □ Prof. Dr. Andreas Polze □ 7 PhD students, 15 Master‘s thesis WiP □ „Extending the reach of Middleware“ Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Chart 2

  3. Introduction ■ SSICLOPS □ S calable and □ S ecure □ I nfrastructures for □ C loud □ OP eration S Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Chart 3

  4. Motivation ■ Power efficiency has acquired an additional facet for the HPC sector □ Running electricity costs exceed initial acquisition costs ■ HPC community is getting interested in low-power (SoC) designs ■ This work focuses on the heterogenous aspects □ CPU: heterogenous multiprocessing / big.LITTLE paradigm □ CPU: improvements of ARMv8-A ISA Are Low-Power SoCs Feasible for □ GPU: SoC-grade GPUs has become OpenCL capable Heterogenous HPC Workloads? Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Chart 4

  5. Motivation (continued) ■ We provide the following contributions: □ We investigate the heterogenous capabilities of state-of-the-art SoCs, elaborating on both the heterogenous multiprocessing feature of big.LITTLE CPUs and the compute capabilities of SoC-grade GPUs. □ We compare the characteristics of ARMv8-A versus ARMV7-A SoCs. □ Based on the narrowing gap between ARM and x86_64 based SoCs, we Are Low-Power anticipate the potential of forthcoming ARM desings in the HPC domain. SoCs Feasible for Heterogenous HPC Workloads? Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Chart 5

  6. Related Work ■ GreenDestiny (2002) / MegaProto (2005) □ first attempts at using low power hardware in HPC scenarios □ GreenDestiny: 240x TM5600 667-MHz CPUs à 13.5 MFLOPS/Watt □ MegaProto: 512x TM8820 1-GHz CPUs à 100 MFLOPS/Watt ■ Rajovic et al. (2013) / Mont-Blanc project Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? □ NVIDIA Tegra 2 & 3 (Cortex-A9); Samsung Exynos 5250 (Cortex-A15) Are Low-Power SoCs Feasible for □ No GPU compute support at the time Heterogenous HPC Workloads? □ Outlook is promising, but current SoCs have many issues Max Plauth, UCHPC@EuroPar'16, – No ECC, unstable PCIe implementation, etc. 23.08.2016 Chart 6

  7. Hardware Targets ■ Raspberry Pi 3 □ SoC : Broadcom BCM2837 □ CPU : 4x ARM Cortex-A53 CPU ARMv8-A – 1.2GHz, In-order execution – L1$ (I/D): 32KB/32KB – L2$: 512KB □ Memory : 1GB LPDDR2 (900 MHz) Are Low-Power □ GPU : BCM VideoCore IV (no compute capabilities) SoCs Feasible for Heterogenous HPC □ OS : Ubuntu MATE 15.10 / Linux 4.1.18-v7+ (armv7l) Workloads? Max Plauth, □ Compiler : GCC v5.2.1 UCHPC@EuroPar'16, 23.08.2016 Chart 7

  8. Hardware Targets (continued) ■ Odroid-C2 □ SoC : Amlogic S905 □ CPU : 4x ARM Cortex-A53 CPU, ARMv8-A – 2.0 GHz, in-order execution – L1$ (I/D): 32KB/32KB – L2$: 512KB □ Memory : 2GB DDR3 (32 bit / 912Mhz) Are Low-Power □ GPU : ARM Mali-450 (no compute capabilities) SoCs Feasible for Heterogenous HPC □ OS : Ubuntu MATE 16.04 / Linux 3.14.29-29 (aarch64) Workloads? Max Plauth, □ Compiler : GCC v5.3.1 UCHPC@EuroPar'16, 23.08.2016 Chart 8

  9. Hardware Targets (continued) ■ Odroid-XU4 □ SoC : Samsung Exynos 5422 □ CPU : big.LITTLE octa core, ARMv7-A – 4x Cortex-A7, 1.5GHz, in-order-execution – 4x Cortex-A15, 2.0GHz, out-of-order-exec. – L1$ (I/D): 32KB/32KB – L2$: 512KB (A7) / 2MB (A15) Are Low-Power □ Memory : 2GB LPDDR3 (32 bit / 933MHz, PoP) SoCs Feasible for Heterogenous HPC □ GPU : ARM Mali-T628 MP6 (OpenCL v1.1) Workloads? Max Plauth, □ OS : Ubuntu MATE 15.10 / Linux 3.10.96-78 (armv7l, HMP) UCHPC@EuroPar'16, 23.08.2016 □ Compiler : GCC v5.2.1 Chart 9

  10. Hardware Targets (continued) ■ HPE ProLiant m710p Server Cartridge □ CPU : Intel Xeon E3-1284L v4, 4C/8T, x86_64 – 2.90GHz, out-of-order – L1$ (I/D): 32KB/32KB (per core) – L2$: 256KB (per core) – L3$: 6MB (shared) – L4$: 128MB eDRAM Are Low-Power □ Memory : 32GB DDR3-1600 SODIMM SoCs Feasible for Heterogenous HPC □ GPU : Iris Pro P6300 BroadWell GT3 (OpenCL v1.2) Workloads? Max Plauth, □ OS Ubuntu 16.04 LTS / Linux 4.4.0-21 (x86\_64) UCHPC@EuroPar'16, 23.08.2016 □ Compiler : GCC v5.3.1 Chart 10

  11. Benchmark procedure ■ Rodinia Suite: picked 4 tests to cover major Berkley Dwarfs – Structured Grid (Leukocyte Tracking) – Unstructured Grid (CFD Solver) – Dense Linear Algebra (k-Nearest Neighbours) – Graph Traversal (Breadth-First Search) □ Warm-Up run + 10 repeated measurements □ Energy consumption measured (Off/Idle/Load) Are Low-Power SoCs Feasible for Heterogenous HPC ■ STREAM Benchmark (Memory Bandwith) Workloads? Max Plauth, UCHPC@EuroPar'16, ■ TinyMemBench (Memory Latency) 23.08.2016 Chart 11

  12. Power Consumption RPI 3 C2 XU4 XU4 XU4 m710p m710p (A7) (A15) (GPU) (CPU) (GPU) Off 0.50 1.00 0.70 0.70 0.70 9.85 9.85 Idle 1.70 2.30 3.80 3.80 3.80 20.65 20.65 Load 2.70 4.10 5.10 11.70 6.60 79.45 67.93 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? ■ SBCs: power consumption was measured using an power outlet meter Max Plauth, UCHPC@EuroPar'16, ■ M710p: power consumption were retrieved through HPE iLO mgmt interface 23.08.2016 Chart 12

  13. STREAM memory bandwidth 16000 Memory Bandwidth [MB/s] Copy 14000 Scale 12000 Add 6000 Triad 4000 2000 Are Low-Power SoCs Feasible for Heterogenous HPC 0 Workloads? 3 ) ) ) ) ) p 2 4 7 5 h Max Plauth, 0 I 3 6 A 1 t P 1 o ( ( A ( R 7 UCHPC@EuroPar'16, B 2 2 ( 4 m C C ( U 4 23.08.2016 4 U X U X Chart 13 X

  14. TinyMemBench memory latency 250 RPI 3 Memory Latency [ns] C2 200 XU4 (A7) 150 XU4 (A15) m710p 100 50 Are Low-Power 0 SoCs Feasible for 4 8 6 2 4 8 6 2 4 8 6 Heterogenous HPC 6 2 5 1 2 4 9 9 8 6 3 1 2 5 0 0 0 1 3 7 5 Workloads? 1 2 4 8 6 2 5 1 3 6 Max Plauth, UCHPC@EuroPar'16, Block Size [KiB] 23.08.2016 Chart 14

  15. Structured Grid (Leukocyte Tracking) 150 400 Energy-to-Computation [J] Time-to-Computation [s] 300 100 200 50 100 0 0 3 3 ) ) ) ) ) ) ) ) ) ) ) ) ) ) 2 4 7 5 U U U 2 4 7 5 U U U I I 3 6 A 1 3 6 A 1 P P P P P P P P Are Low-Power ( ( A ( ( A ( ( R G C G R G C G 2 2 2 2 ( ( 4 4 ( ( ( ( ( ( C C C C U 4 U 4 SoCs Feasible for p p 4 p 4 p U U X X U 0 U 0 0 0 X X 1 1 X 1 X 1 Heterogenous HPC 7 7 7 7 m m m m Workloads? ■ Heterogeneity: EtC(XU4 GPU) <<< EtC(A15) < EtC(A7); Max Plauth, UCHPC@EuroPar'16, ■ ARMv8-A: +105% EtC (C2/64 compared to A7), +24% TtC (C2/64 vs. C2/32) 23.08.2016 ■ ARM vs. x86_64: XU4 GPU delivers competitive EtC and TtC performance Chart 15

  16. Unstructured Grid (CFD Solver) 500 1500 Energy-to-Computation [J] Time-to-Computation [s] 400 1000 300 200 500 100 0 0 3 3 ) ) ) ) ) ) ) ) ) ) ) ) ) ) 2 4 7 5 U U U 2 4 7 5 U U U I I 3 6 A 1 3 6 A 1 P P P P P P P P Are Low-Power ( ( A ( ( A ( ( R G C G R G C G 2 2 2 2 ( ( 4 4 ( ( ( ( ( ( C C C C U 4 U 4 SoCs Feasible for p p 4 p 4 p U U X X U 0 U 0 0 0 X X 1 1 X 1 X 1 Heterogenous HPC 7 7 7 7 m m m m Workloads? ■ Heterogeneity: TtC(XU4 GPU) < TtC(A15) <<< TtC(A7); Max Plauth, UCHPC@EuroPar'16, ■ ARMv8-A: +72% EtC (C2/64 compared to A7), no autovectorization à C2/64 > C2/32 23.08.2016 ■ ARM vs. x86_64: XU4 GPU is competitive for EtC, but nothing else Chart 16

  17. Dense Linear Algebra (k-Nearest Neighbours) 0.3 0.8 Energy-to-Computation [J] Time-to-Computation [s] 0.6 0.2 0.4 0.1 0.2 0.0 0.0 3 3 ) ) ) ) ) ) ) ) ) ) ) ) ) ) 2 4 7 5 U U U 2 4 7 5 U U U I I 3 6 A 1 3 6 A 1 P P P P P P P P Are Low-Power ( ( A ( ( A ( ( R G C G R G C G 2 2 2 2 ( ( 4 4 ( ( ( ( ( ( C C C C U 4 U 4 SoCs Feasible for p 4 p p 4 p U U X X U 0 U 0 0 0 X X 1 1 X 1 X 1 Heterogenous HPC 7 7 7 7 m m m m Workloads? ■ Heterogeneity: EtC(A7) < EtC(A15); TtC(XU4 GPU) <<< TtC(A15) < TtC(A7); Max Plauth, UCHPC@EuroPar'16, ■ ARMv8-A: +186% EtC (C2/64 compared to A7), +73% TtC (C2/64 vs. C2/32) 23.08.2016 ■ ARM vs. x86_64: XU4 GPU delivers competitive EtC and TtC performance Chart 17

  18. Graph Traversal (Breadth-First Search) 0.8 5 Energy-to-Computation [J] Time-to-Computation [s] 4 0.6 3 0.4 2 0.2 1 0.0 0 RPI 3 RPI 3 C2 (32) C2 (64) XU4 (A7) XU4 (A15) XU4 (GPU) m710p (CPU) m710p (GPU) C2 (32) C2 (64) XU4 (A7) XU4 (A15) XU4 (GPU) m710p (CPU) m710p (GPU) Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? ■ Heterogeneity: EtC(XU4 GPU) < EtC(A7/A15); TtC(A15) < TtC(XU4 GPU) < TtC(A7); Max Plauth, UCHPC@EuroPar'16, ■ ARMv8-A: no autovectorization à C2/64 > C2/32 23.08.2016 ■ ARM vs. x86_64: superior EtC performance for ARM-based hardware Chart 18

Recommend


More recommend