From Classical to Runtime Aware Architectures Prof. Mateo Valero BSC Director Cursos de Postgrado Workshop Syec 25-26 April Madrid, 25 Abril 2017
Technological Achievements Transistor (Bell Labs, 1947) • DEC PDP-1 (1957) • IBM 7090 (1960) Integrated circuit (1958) • IBM System 360 (1965) • DEC PDP-8 (1965) Microprocessor (1971) • Intel 4004
Birth of the Revolution – The Intel 4004 Introduced November 15, 1971 108KHz, 50 KIPs, 2300 10 μ transistors
Sunway TaihuLight • SW26010 processor (Chinese design, ISA, & fab) • 1.45 GHz • Node = 260 Cores (1 socket) • 4 – core groups • 32 GB memory • 40,960 nodes in the system • 10,649,600 cores total • 1.31 PB of primary memory (DDR3). • 125.4 Pflop/s theoretical peak • 93 Pflop/s HPL, 74% peak • 15.3 Mwatts water cooled • 3 of the 6 finalists for Gordon Bell Award@SC16
Top 500 Supercomputers - November 2016 Rank Name Site Computer Total Cores Rmax Rpeak Power Mflops/W Sunway National Supercomputing Sunway MPP, Sunway SW26010 93014593, 1 10649600 125435904 15371 6051,3 TaihuLight Center in Wuxi 260C 1.45GHz, Sunway 88 TH-IVB-FEP Cluster, Intel Xeon Tianhe-2 National Super Computer 2 E5-2692 12C 2.200GHz, TH 3120000 33862700 54902400 17808 1901,54 (MilkyWay-2) Center in Guangzhou Express-2, Intel Xeon Phi 31S1P Cray XK7 , Opteron 6274 16C DOE/SC/Oak Ridge National 3 Titan 2.200GHz, Cray Gemini 560640 17590000 27112550 8209 2142,77 Laboratory interconnect, NVIDIA K20x BlueGene/Q, Power BQC 16C 4 Sequoia DOE/NNSA/LLNL 1572864 17173224 20132659,2 7890 2176,58 1.60 GHz, Custom Cray XC40, Intel Xeon Phi 7250 5 Cori DOE/SC/LBNL/NERSC 622336 14014700 27880653 3939 3557,93 68C 1.4GHz, Aries interconnect Joint Center for Advanced PRIMERGY CX1640 M1, Intel Oakforest- 6 High Performance Xeon Phi 7250 68C 1.4GHz, Intel 556104 13554600 24913459 2718,7 4985,69 PACS Computing Omni-Path RIKEN Advanced Institute for K computer, SPARC64 VIIIfx 7 Computational Science 705024 10510000 11280384 12659,89 830,18 2.0GHz, Tofu interconnect (AICS) Swiss National Cray XC50, Xeon E5-2690v3 12C 8 Piz Daint Supercomputing Centre 2.6GHz, Aries interconnect , 206720 9779000 15987968 1312 7453,51 (CSCS) NVIDIA Tesla P100 DOE/SC/Argonne National BlueGene/Q, Power BQC 16C 9 Mira 786432 8586612 10066330 3945 2176,58 Laboratory 1.60GHz, Custom Cray XC40, Xeon E5-2698v3 16C 10 Trinity DOE/NNSA/LANL/SNL 301056 8100900 11078861 4232,63 1913,92 2.3GHz, Aries interconnect
Performance Development of HPC over the Last 23 Years from the Top500 567 PFlop/s 1E+09 1 Eflop/s 93 PFlop/s 100000000 100 Pflop/s 10000000 10 Pflop/s 1000000 1 Pflop/s SUM 286 TFlop/s 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 1.17 TFlop/s 100 Gflop/s 100 59.7 GFlop/s N=500 10 Gflop/s 10 400 MFlop/s 1 Gflop/s 1 100 Mflop/s 0,1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Supercomputer Performance Road Map
Our origins...Plan Nacional de Investigación High-performance Computing group @ Computer Architecture Department (UPC) Relevance INTEL MICROSOFT (Exascale) INTEL SAMSUNG IBM NVIDIA COMPAQ IBERDROLA REPSOL Excellence High Performance High-speed Parallelism Architectures and High Performance High Performance High Performance High Performance High Performance Low-cost Parallel Compilers Computing III Computing IV Computing V Exploitation in High Computing Computing II Computing VI Architecture Design Speed Architectures for Supercomputers TIC2001-995-C02-01 PA85-0314 TIC89-299 TIC92-880 TIC95-429 TIC98-511-C02-01 TIN2004-07739-C02-01 TIN2007-60625 TIN2012-34557 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 - 2011 2012 - 2015 BSC CEPBA CIRI
Venimos de muy lejos… Parsys Multiprocessor Parsytec CCi-8D Maricel Compaq GS-160 BULL NovaScale 5160 Compaq GS-140 4.45 Gflop/s 14.4 Tflops, 20 KW 23.4 Gflop/s 48 Gflop/s 12.5 Gflop/s Transputer cluster Convex C3800 SGI Origin 2000 SGI Altix 4700 SL8500 32 Gflop/s 819.2 Gflops 6 Petabytes Connection Machine CM-200 Research prototypes 0,64 Gflop/s IBM RS-6000 SP & IBM p630 IBM PP970 / Myrinet 192+144 Gflop/s MareNostrum 42.35, 94.21 Tflop/s 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives PhD programme, Supercomputing services R&D in Computer, to Spanish and technology transfer, Life, Earth and public engagement EU researchers Engineering Sciences 60% Spanish Government BSC-CNS is a consortium 30% Catalonian Government that includes Univ. Politècnica de Catalunya (UPC) 10%
Barcelona Supercomputing Center Centro Nacional de Supercomputación 475 people from 44 countries * 31 th of December 2016 Europe 71,9M€ Competitive project funding National 34 M € secured (2005 to 2017) Companies 38,9 M € Information compiled 16/01/2017 Total 144,8 M €
The MareNostrum 3 Supercomputer Over 10 15 Floating Point Operations per second Nearly 100.8 TB 3 PB 50,000 cores of main memory of disk storage 70% PRACE 24% RES 6% BSC-CNS
The MareNostrum 4 Supercomputer Compute Network Emerging Technologies, for evaluation Total peak performance Storage General Purpose, for current BSC workload IB EDR/OPA of 2020 Exascale systems More than 10 PB of GPFS 13,7 Pflops/s More than 11 Pflops/s Ethernet 3 systems, each of more than 0,5 Pflops/s Elastics Storage System Operating System: SuSE 12 times more powerful than MareNostrum 3 With 3,456 nodes of Intel Xeon V5 processors with KLN/KNH, Power+NVIDIA, ARMv8
Mission of BSC Scientific Departments Computer Earth Sciences Sciences To influence the way machines are built, programmed and To develop and implement global and used: computer architecture, programming models, regional state-of-the-art models for short- performance tools, Big Data, Artificial Intelligence term air quality forecast and long-term climate applications Life CASE Sciences To develop scientific and engineering software to To understand living organisms by means of efficiently exploit super-computing capabilities theoretical and computational methods (biomedical, geophysics, atmospheric, energy, social (molecular modeling, genomics, proteomics) and economic simulations)
Design of Superscalar Processors Decoupled from the software stack Programs “decoupled” from hardware Applications Simple interface Sequential ISA program ILP
Latency Has Been a Problem from the Beginning... Data Cache Instruction Wakeup+ Rename Register Register Window Decode Commit Bypass select Fetch Write file • Feeding the pipeline with the right instructions: • HW/SW trace cache (ICS’99) • Prophet/Critic Hybrid Branch Predictor (ISCA’04) • Locality/reuse • Cache Memory with Hybrid Mapping (IASTED87). Victim Cache • Dual Data Cache (ICS¨95) • A novel renaming mechanism that boosts software prefetching (ICS’01) • Virtual-Physical Registers (HPCA’98) • Kilo Instruction Processors (ISHPC03,HPCA’06, ISCA’08 )
… and the Power Wall Appeared Later Data Cache Instruction Wakeup+ Rename Register Register Window Decode Commit Bypass select Fetch Write file • Better Technologies • Two-level organization (Locality Exploitation) • Register file for Superscalar (ISCA’00) • Instruction queues (ICCD’05) • Load/Store Queues (ISCA’08) • Direct Wakeup, Pointer-based Instruction Queue Design (ICCD’04, ICCD’05) • Content-aware register file (ISCA’09 ) • Fuzzy computation (ICS’01, IEEE CAL’02, IEEE - TC’05). Currently known as Approximate Computing
Fuzzy computation Performance This image is the @ Low Power original one Fuzzy Computation Binary Compresion systems protocols (bmp) (jpeg) This one only used ~85% of the time Accuracy Size while consuming ~75% of the power
SMT and Memory Latency … Data Cache Instruction Thread 1 Wakeup+ Rename Window Register Register Commit Decode Bypass select Write Fetch file Thread N • Simultaneous Multithreading (SMT) • Benefits of SMT Processors: • Increase core resource utilization • Basic pipeline unchanged: • Few replicated resources, other shared • Some of our contributions: • Dynamically Controlled Resource Allocation (MICRO 2004) • Quality of Service (QoS) in SMTs (IEEE TC 2006) • Runahead Threads for SMTs (HPCA 2008)
Time Predictability (in multicore and SMT processors) QoS Definition: space • Ability to provide a minimum performance to a task • Requires biasing processor resource allocation • Where is it required: • Increasingly required in handheld/desktop devices • Also in embedded hard real- time systems (cars, planes, trains, …) • How to achieve it: • Controlling how resources are assigned to co-running tasks • Soft real-time systems • SMT: DCRA resource allocation policy (MICRO 2004, IEEE Micro 2004) • Multicores: Cache partitioning (ACM OSR 2009, IEEE Micro 2009) • Hard real-time systems • Deterministic resource ‘securing’ (ISCA 2009) • Time-Randomised designs (DAC 2014 best paper award)
Recommend
More recommend