Jean-Franois Mhaut This project and the research leading to these - PowerPoint PPT Presentation

http://www.montblanc-project.eu Jean-François Méhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

http://www.montblanc-project.eu  The New Killer Processors  Overview of the Mont-Blanc projects  BOAST DSL for computing kernels This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

C orse : C ompiler O ptimization and R un - time S yst E ms ∗ Fabrice Rastello ∗ I nria Joint Project Team (proposal) June 9, 2015 Fabrice Rastello (I nria ) C orse June 9, 2015 1 / 26

Project-team composition / Institutional context Joint Project-Team (I nria , Grenoble INP , UJF) in the LIG laboratory @ G iant /Minatec Fabrice R astello , Florent B ouchez T ichadou , François B roquedis , Frédéric D esprez , Yliès F alcone , Jean-François M´ ehaut 8 PhD, 3 Post-doc, 1 Engineer Fabrice Rastello (I nria ) C orse June 9, 2015 3 / 26

Permanent member curriculum vitae Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray, N anosim ) compiler optimization, compiler back-end François Broquedis MdC INP (PhD Bordeaux 2010, 1Y M escal , 3Y M oais ) runtime systems, OpenMP , memory management Frédéric Desprez (DR1 I nria : Graal, Avalon) parallel algorithmic, numerical libraries Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, V asco ) validation, enforcement, debugging, runtime Jean-François Mehaut Pr UJF ( M escal , N anosim ) runtime, debugging, memory management, scientific applications Fabrice Rastello CR1 I nria (PhD Lyon 2000, 2Y STMicro, C ompsys , GCG) compiler optimization, graph theory, compiler back-end, automatic parallelization Fabrice Rastello (I nria ) C orse June 9, 2015 4 / 26

Overall Objectives Domain : Compiler optimization and runtime systems for performance and energy consumption (not reliability, nor WCET) Issues: Scalability and heterogeneity/complexity ≡ trade-off between specific optimizations and programmability/portability Target architectures: VLIW / SIMD / embedded / many-cores / heterogeneity Applications: dynamic-systems / loop-nests / graph-algorithmic / signal-processing Approach: combine static/dynamic & compiler/run-time Fabrice Rastello (I nria ) C orse June 9, 2015 5 / 26

First, vector processors dominated HPC • 1st Top500 list (June 1993) dominated by DLP architectures • Cray vector,41% • MasPar SIMD, 11% • Convex/HP vector, 5% • Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS

Then, commodity took over special purpose • • ASCI Red, Sandia ASCI White, LLNL • 1997, 1 TFLOPS • 2001, 7.3 TFLOPS • 9,298 cores @ 200 Mhz • 8,192 proc. @ 375 Mhz, • Intel Pentium Pro • IBM Power 3 • Upgraded to Pentium II Xeon, 1999, 3.1 TFLOPS Transition from Vector parallelism to Message-Passing Programming Models

Commodity components drive HPC • RISC processors replaced vectors • x86 processors replaced RISC • Vector processors survive as (widening) SIMD extensions

The killer microprocessors 10.000 Cray-1, Cray-C90 NEC SX4, SX5 1000 MFLOPS Alpha AV4, EV5 Intel Pentium 100 IBM P2SC HP PA8200 10 1974 1979 1984 1989 1994 1999 • Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly cheaper and greener • Need 10 microprocessors to achieve the performance of 1 Vector CPU • SIMD vs. MIMD programming paradigms 5

The killer mobile processors TM 1.000.000 Alpha 100.000 Intel MFLOPS AMD 10.000 NVIDIA Tegra Samsung Exynos 1.000 4-core ARMv8 1.5 GHz 100 2015 1990 1995 2000 2005 2010 • Microprocessors killed the • History may be about to Vector supercomputers repeat itself … • They were not faster ... • Mobile processor are not • ... but they were significantly faster … • … but they are significantly cheaper and greener cheaper

Mobile SoC vs Server processor Performance Cost 5.2 21$ 1 GFLOPS x30 x70 153 1500$ 2 GFLOPS x10 x70 15.2 21$ (?) GFLOPS 1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige

SoC under study: CPU and Memory NVIDIA Tegra 2 NVIDIA Tegra 3 2 x ARM Cortex-A9 @ 1GHz 4 x ARM Cortex-A9 @ 1.3GHz 1 x 32-bit DDR2-333 channel 2 x 32-bit DDR23-750 channels 32KB L1 + 1MB L2 32KB L1 + 1MB L2 Samsng Exynos 5 Dual Intel Core i7-2760QM 2 x ARM Cortex-A15 @ 1.7GHz 4 x Intel SandyBrdige @ 2.4GHz 2 x 32-bit DDR3-800 channels 2 x 64-bit DDR3-800 channels 32KB L1 + 1MB L2 32KB L1 + 1MB L2 + 6MB L3

Evaluated kernels pthreads OpenMP OpenCL OmpSs CUDA Tag Full name Properties Common operation in numerical      vecop Vector operation codes Data reuse an compute      dmmm Dense matrix-matrix multiply performance Strided memory accesses (7-point      3dstc 3D volume stencil 3D stencil)      2dcon 2D convolution Spatial locality Peak floating-point, variable stride      fft 1D FFT transform accesses      red Reduction operation Varying levels of parallelism Local privatization and reduction      hist Histogram calculation stage      msort Generic merge sort Barrier synchronization      nbody N-body calculation Irregular memory accesses Markov chain Monte-Carlo      amcd Embarassingly parallel method      spvm Sparse matrix-vector multiply Load imbalance

Single core performance and energy • Tegra3 is 1.4x faster than Tegra2 • Higher clock frequency • Exynos 5 is 1.7x faster than Tegra3 • Better frequency, memory bandwidth, and core microarchitecture • Intel Core i7 is ~3x better than ARM Cortex-A15 at maximum frequency • ARM platforms more energy-efficient than Intel platform

Multicore performance and energy • Tegra3 is as fast as Exynos 5, a bit more energy efficient • 4-core vs. 2-core • ARM multicores as efficient as Intel at the same frequency • Intel still more energy efficient at highest performance • ARM CPU is not the major power sink in the platform

Memory bandwidth (STREAM) • Exynos 5 improves dramatically over Tegra (4.5x) • Dual-channel DDR3 • ARM Cortex-A15 sustains more in-flight cache misses

Tibidabo: The first ARM HPC multicore cluster Q7 Tegra 2 2 Racks 2 x Cortex-A9 @ 1GHz 32 blade containers 2 GFLOPS 256 nodes 512 cores 5 Watts (?) 9x 48-port 1GbE switch 0.4 GFLOPS / W 512 GFLOPS Q7 carrier board 3.4 Kwatt 2 x Cortex-A9 0.15 GFLOPS / W 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W • Proof of concept • It is possible to deploy a cluster of smartphone processors • Enable software stack development

HPC System software stack on ARM • Open source system software Source files (C, C++, FORTRAN, …) stack Compiler(s) • Ubuntu Linux OS gcc gfortran … OmpSs • GNU compilers Executable(s) • gcc, g++, gfortran Scientific libraries • Scientific libraries ATLAS FFTW HDF5 … … • ATLAS, FFTW, HDF5,... Developer tools • Slurm cluster management Paraver Scalasca … • Runtime libraries • MPICH2, OpenMPI Cluster management (Slurm) • OmpSs toolchain OmpSs runtime library (NANOS++) • Performance analysis tools GASNet CUDA OpenCL • Paraver, Scalasca MPI • Allinea DDT 3.1 debugger Linux Linux Linux • Ported to ARM CPU GPU … CPU GPU CPU GPU

Parallel scalability • HPC applications scale well on Tegra2 cluster • Capable of exploiting enough nodes to compensate for lower node performance

SoC under study: Interconnection NVIDIA Tegra 2 NVIDIA Tegra 3 1 GbE (PCIe) 1 GbE (PCIe) 100 Mbit (USB 2.0) 100 Mbit (USB 2.0) Samsng Exynos 5 Dual Intel Core i7-2760QM 1 GbE (USB3.0) 1 GbE (PCIe) 100 Mbit (USB 2.0) QDR Infiniband (PCIe)

Interconnection network: Latency • TCP/IP adds a lot of CPU overhead • OpenMX driver interfaces directly to the Ethernet NIC • USB stack adds extra latency on top of network stack Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

Interconnection network: Bandwidth • TCP/IP overhead prevents Cortex-A9 CPU from achieving full bandwidth • USB stack overheads prevent Exynos 5 from achieving full bandwidth, even on OpenMX Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

Interconnect vs. Performance ratio Peak IN bytes / FLOPS 1 Gb/s 6 Gb/s 40 Gb/s Tegra2 0.06 0.38 2.50 Tegra3 0.02 0.14 0.96 Exynos 5250 0.02 0.11 0.74 Intel i7 0.00 0.01 0.07 • Mobile SoC have low-bandwidth interconnect … • 1 GbE or USB 3.0 (6Gb/s) • … but ratio to performance is similar to high-end • 40 Gb/s Inifiniband

Jean-Franois Mhaut This project and the research leading to these - PowerPoint PPT Presentation

http://www.montblanc-project.eu Jean-Franois Mhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n 288777.

Public hearings Public hearings Fran ois ois Hou Hou ez ez, 21 March 2012 , 21

Chteau haut BeyzaC Chteau des tourtes haut MdoC Blaye-Ctes de Bordeaux VIGNoBles

Linearized internal functionals for anisotropic conductivities Chenxi Guo Chenxi Guo Chenxi Guo

2019 :2007 OR 2017? Jean Fran ois Robin Natixis Global Head of Market Research

Memo Tables Jean-Christophe Filli atre CNRS joint work with Fran c ois Bobot and Andrei

An overview of alphaCaml Franc ois Pottier September 2005 Franc ois Pottier An overview

Static name control for FreshML Franc ois Pottier Franc ois Pottier Static name control

Windows 7 at CERN Juraj Sucik Micha Budzowski IT/OIS CERN IT Department CH-1211 Geneva 23 4

Type-Based Information Flow Analyses Franc ois Pottier January 2728, 2005 Franc ois

Marie-Fran coise Daza, Laurent Ariza: El amor en los tiempos del coronavirus Marie-Fran

Towards efficient, typed LR parsers Franc ois Pottier and Yann R egis-Gianas June 2005

Objectives Diagnose and manage common opportunistic infections (OIs) in HIV Know the

Constraint-based type inference for GADTs Vincent Simonet, Franc ois Pottier November 16, 2004

Where is ML type inference headed? Constraint solving meets local shape inference Franc ois

Fran Pulver, MD - - PM&R PM&R Fran Pulver, MD Laurie Bell, PT - - Physical Therapy

Fran Pulver, MD - - PM&R PM&R Fran Pulver, MD Laurie Bell, PT - - Physical Therapy

Deformation in the Long Valley caldera, eastern California Akiko T ANAKA Geological Survey of

Doing Business With CNS, LLC Cindy Morgan cynthia.morgan@cns.doe.gov Supply Chain Sr. Manager v

Budget Conference Highlights April 17, 2012 1 House Budget Development Objectives The House

Sentinel-1 Constellation SAR Interferometry Performance Verification Dirk Geudtner, Pau Prats,

CubeSat Camera CCAM: A Low Cost Imaging System for CubeSat Platforms William Brzozowski

Mineral Aerosol Phenomena and Consequences on Mars and Earth Meteorology 215 Seminar John Noble

Conseil National de lOrdre des Mdecins de Tunisie Tunis, 19 th October 2017 Ref: 17ND

THE ROLE OF THE FRENCH MEDICAL COUNCIL IN ADDRESSING THE SOCIAL DETERMINANTS OF HEALTH Dr.

Jean-Franois Mhaut This project and the research leading to these - PowerPoint PPT Presentation

http://www.montblanc-project.eu Jean-Franois Mhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n 288777.

Public hearings Public hearings Fran ois ois Hou Hou ez ez, 21 March 2012 , 21

Chteau haut BeyzaC Chteau des tourtes haut MdoC Blaye-Ctes de Bordeaux VIGNoBles

Linearized internal functionals for anisotropic conductivities Chenxi Guo Chenxi Guo Chenxi Guo

2019 :2007 OR 2017? Jean Fran ois Robin Natixis Global Head of Market Research

Memo Tables Jean-Christophe Filli atre CNRS joint work with Fran c ois Bobot and Andrei

An overview of alphaCaml Franc ois Pottier September 2005 Franc ois Pottier An overview

Static name control for FreshML Franc ois Pottier Franc ois Pottier Static name control

Windows 7 at CERN Juraj Sucik Micha Budzowski IT/OIS CERN IT Department CH-1211 Geneva 23 4

Type-Based Information Flow Analyses Franc ois Pottier January 2728, 2005 Franc ois

Marie-Fran coise Daza, Laurent Ariza: El amor en los tiempos del coronavirus Marie-Fran

Towards efficient, typed LR parsers Franc ois Pottier and Yann R egis-Gianas June 2005

Objectives Diagnose and manage common opportunistic infections (OIs) in HIV Know the

Constraint-based type inference for GADTs Vincent Simonet, Franc ois Pottier November 16, 2004

Where is ML type inference headed? Constraint solving meets local shape inference Franc ois

Fran Pulver, MD - - PM&amp;R PM&amp;R Fran Pulver, MD Laurie Bell, PT - - Physical Therapy

Fran Pulver, MD - - PM&amp;R PM&amp;R Fran Pulver, MD Laurie Bell, PT - - Physical Therapy

Deformation in the Long Valley caldera, eastern California Akiko T ANAKA Geological Survey of

Doing Business With CNS, LLC Cindy Morgan cynthia.morgan@cns.doe.gov Supply Chain Sr. Manager v

Budget Conference Highlights April 17, 2012 1 House Budget Development Objectives The House

Sentinel-1 Constellation SAR Interferometry Performance Verification Dirk Geudtner, Pau Prats,

CubeSat Camera CCAM: A Low Cost Imaging System for CubeSat Platforms William Brzozowski

Mineral Aerosol Phenomena and Consequences on Mars and Earth Meteorology 215 Seminar John Noble

Conseil National de lOrdre des Mdecins de Tunisie Tunis, 19 th October 2017 Ref: 17ND

THE ROLE OF THE FRENCH MEDICAL COUNCIL IN ADDRESSING THE SOCIAL DETERMINANTS OF HEALTH Dr.

Fran Pulver, MD - - PM&R PM&R Fran Pulver, MD Laurie Bell, PT - - Physical Therapy

Fran Pulver, MD - - PM&R PM&R Fran Pulver, MD Laurie Bell, PT - - Physical Therapy