http://www.montblanc-project.eu Jean-François Méhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu The New Killer Processors Overview of the Mont-Blanc projects BOAST DSL for computing kernels This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
C orse : C ompiler O ptimization and R un - time S yst E ms ∗ Fabrice Rastello ∗ I nria Joint Project Team (proposal) June 9, 2015 Fabrice Rastello (I nria ) C orse June 9, 2015 1 / 26
Project-team composition / Institutional context Joint Project-Team (I nria , Grenoble INP , UJF) in the LIG laboratory @ G iant /Minatec Fabrice R astello , Florent B ouchez T ichadou , François B roquedis , Frédéric D esprez , Yliès F alcone , Jean-François M´ ehaut 8 PhD, 3 Post-doc, 1 Engineer Fabrice Rastello (I nria ) C orse June 9, 2015 3 / 26
Permanent member curriculum vitae Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray, N anosim ) compiler optimization, compiler back-end François Broquedis MdC INP (PhD Bordeaux 2010, 1Y M escal , 3Y M oais ) runtime systems, OpenMP , memory management Frédéric Desprez (DR1 I nria : Graal, Avalon) parallel algorithmic, numerical libraries Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, V asco ) validation, enforcement, debugging, runtime Jean-François Mehaut Pr UJF ( M escal , N anosim ) runtime, debugging, memory management, scientific applications Fabrice Rastello CR1 I nria (PhD Lyon 2000, 2Y STMicro, C ompsys , GCG) compiler optimization, graph theory, compiler back-end, automatic parallelization Fabrice Rastello (I nria ) C orse June 9, 2015 4 / 26
Overall Objectives Domain : Compiler optimization and runtime systems for performance and energy consumption (not reliability, nor WCET) Issues: Scalability and heterogeneity/complexity ≡ trade-off between specific optimizations and programmability/portability Target architectures: VLIW / SIMD / embedded / many-cores / heterogeneity Applications: dynamic-systems / loop-nests / graph-algorithmic / signal-processing Approach: combine static/dynamic & compiler/run-time Fabrice Rastello (I nria ) C orse June 9, 2015 5 / 26
First, vector processors dominated HPC • 1st Top500 list (June 1993) dominated by DLP architectures • Cray vector,41% • MasPar SIMD, 11% • Convex/HP vector, 5% • Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS
Then, commodity took over special purpose • • ASCI Red, Sandia ASCI White, LLNL • 1997, 1 TFLOPS • 2001, 7.3 TFLOPS • 9,298 cores @ 200 Mhz • 8,192 proc. @ 375 Mhz, • Intel Pentium Pro • IBM Power 3 • Upgraded to Pentium II Xeon, 1999, 3.1 TFLOPS Transition from Vector parallelism to Message-Passing Programming Models
Commodity components drive HPC • RISC processors replaced vectors • x86 processors replaced RISC • Vector processors survive as (widening) SIMD extensions
The killer microprocessors 10.000 Cray-1, Cray-C90 NEC SX4, SX5 1000 MFLOPS Alpha AV4, EV5 Intel Pentium 100 IBM P2SC HP PA8200 10 1974 1979 1984 1989 1994 1999 • Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly cheaper and greener • Need 10 microprocessors to achieve the performance of 1 Vector CPU • SIMD vs. MIMD programming paradigms 5
The killer mobile processors TM 1.000.000 Alpha 100.000 Intel MFLOPS AMD 10.000 NVIDIA Tegra Samsung Exynos 1.000 4-core ARMv8 1.5 GHz 100 2015 1990 1995 2000 2005 2010 • Microprocessors killed the • History may be about to Vector supercomputers repeat itself … • They were not faster ... • Mobile processor are not • ... but they were significantly faster … • … but they are significantly cheaper and greener cheaper
Mobile SoC vs Server processor Performance Cost 5.2 21$ 1 GFLOPS x30 x70 153 1500$ 2 GFLOPS x10 x70 15.2 21$ (?) GFLOPS 1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige
SoC under study: CPU and Memory NVIDIA Tegra 2 NVIDIA Tegra 3 2 x ARM Cortex-A9 @ 1GHz 4 x ARM Cortex-A9 @ 1.3GHz 1 x 32-bit DDR2-333 channel 2 x 32-bit DDR23-750 channels 32KB L1 + 1MB L2 32KB L1 + 1MB L2 Samsng Exynos 5 Dual Intel Core i7-2760QM 2 x ARM Cortex-A15 @ 1.7GHz 4 x Intel SandyBrdige @ 2.4GHz 2 x 32-bit DDR3-800 channels 2 x 64-bit DDR3-800 channels 32KB L1 + 1MB L2 32KB L1 + 1MB L2 + 6MB L3
Evaluated kernels pthreads OpenMP OpenCL OmpSs CUDA Tag Full name Properties Common operation in numerical vecop Vector operation codes Data reuse an compute dmmm Dense matrix-matrix multiply performance Strided memory accesses (7-point 3dstc 3D volume stencil 3D stencil) 2dcon 2D convolution Spatial locality Peak floating-point, variable stride fft 1D FFT transform accesses red Reduction operation Varying levels of parallelism Local privatization and reduction hist Histogram calculation stage msort Generic merge sort Barrier synchronization nbody N-body calculation Irregular memory accesses Markov chain Monte-Carlo amcd Embarassingly parallel method spvm Sparse matrix-vector multiply Load imbalance
Single core performance and energy • Tegra3 is 1.4x faster than Tegra2 • Higher clock frequency • Exynos 5 is 1.7x faster than Tegra3 • Better frequency, memory bandwidth, and core microarchitecture • Intel Core i7 is ~3x better than ARM Cortex-A15 at maximum frequency • ARM platforms more energy-efficient than Intel platform
Multicore performance and energy • Tegra3 is as fast as Exynos 5, a bit more energy efficient • 4-core vs. 2-core • ARM multicores as efficient as Intel at the same frequency • Intel still more energy efficient at highest performance • ARM CPU is not the major power sink in the platform
Memory bandwidth (STREAM) • Exynos 5 improves dramatically over Tegra (4.5x) • Dual-channel DDR3 • ARM Cortex-A15 sustains more in-flight cache misses
Tibidabo: The first ARM HPC multicore cluster Q7 Tegra 2 2 Racks 2 x Cortex-A9 @ 1GHz 32 blade containers 2 GFLOPS 256 nodes 512 cores 5 Watts (?) 9x 48-port 1GbE switch 0.4 GFLOPS / W 512 GFLOPS Q7 carrier board 3.4 Kwatt 2 x Cortex-A9 0.15 GFLOPS / W 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W • Proof of concept • It is possible to deploy a cluster of smartphone processors • Enable software stack development
HPC System software stack on ARM • Open source system software Source files (C, C++, FORTRAN, …) stack Compiler(s) • Ubuntu Linux OS gcc gfortran … OmpSs • GNU compilers Executable(s) • gcc, g++, gfortran Scientific libraries • Scientific libraries ATLAS FFTW HDF5 … … • ATLAS, FFTW, HDF5,... Developer tools • Slurm cluster management Paraver Scalasca … • Runtime libraries • MPICH2, OpenMPI Cluster management (Slurm) • OmpSs toolchain OmpSs runtime library (NANOS++) • Performance analysis tools GASNet CUDA OpenCL • Paraver, Scalasca MPI • Allinea DDT 3.1 debugger Linux Linux Linux • Ported to ARM CPU GPU … CPU GPU CPU GPU
Parallel scalability • HPC applications scale well on Tegra2 cluster • Capable of exploiting enough nodes to compensate for lower node performance
SoC under study: Interconnection NVIDIA Tegra 2 NVIDIA Tegra 3 1 GbE (PCIe) 1 GbE (PCIe) 100 Mbit (USB 2.0) 100 Mbit (USB 2.0) Samsng Exynos 5 Dual Intel Core i7-2760QM 1 GbE (USB3.0) 1 GbE (PCIe) 100 Mbit (USB 2.0) QDR Infiniband (PCIe)
Interconnection network: Latency • TCP/IP adds a lot of CPU overhead • OpenMX driver interfaces directly to the Ethernet NIC • USB stack adds extra latency on top of network stack Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results
Interconnection network: Bandwidth • TCP/IP overhead prevents Cortex-A9 CPU from achieving full bandwidth • USB stack overheads prevent Exynos 5 from achieving full bandwidth, even on OpenMX Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results
Interconnect vs. Performance ratio Peak IN bytes / FLOPS 1 Gb/s 6 Gb/s 40 Gb/s Tegra2 0.06 0.38 2.50 Tegra3 0.02 0.14 0.96 Exynos 5250 0.02 0.11 0.74 Intel i7 0.00 0.01 0.07 • Mobile SoC have low-bandwidth interconnect … • 1 GbE or USB 3.0 (6Gb/s) • … but ratio to performance is similar to high-end • 40 Gb/s Inifiniband
Recommend
More recommend