Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016

Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 2

Top Ten Challenges for Exascale: Areas where Research and advances are needed! Energy Efficiency 1. Interconnect Technology 2. Memory Technology 3. Scalable System Software 4. Programming Systems 5. Data Management 6. Exascale Algorithms 7. Algorithms for Discovery, Design 8. & Decision DoE ASCAC Resilience and Correctness 9. Subcommittee Report 10. Scientific Productivity Feb 2014 Data movement and/or programming related

Technological Challenges: Combined Bandwidth and Energy Challenges for Exascale Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14]  Locality and data movement matter a lot, cost (energy and time) rapidly increases with distance  Locality and data movement are critical even at short distance, more so at far distances

Technological Challenges : (2) Bandwidth Widening gap between available I/O Growing manycore bandwidth requirements and compute capability 0.35 Xeon Phi (Knights Corner) 0.3 Bytes/FLOP 0.25 NVIDIA K20 0.2 K40 0.15 K80 Xeon Phi 0.1 (Knights 0.05 Landing) 0 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 Year Ref: Miller, D. A, Proceedings of the IEEE , 2009.  Interconnect is not keeping up with the growth in compute capability Many apps require 1 Byte/FLOP off-chip, not possible in 10 TFLOPs chips and beyond  Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP Huge bandwidth density (GB/s/ μ m) needed on-chip due to large #cores in small area 

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems 7 Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems Cray XC40 11 Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems  TTT TILE64 Tile64 Cray XC40 12 Tarek El-Ghazawi, GWU

Where are Programming Models from That?  What is a programming model? An abstract virtual machine  A view of data and execution  The logical interface between architecture and applications   Why Programming Models? Decouple applications and architectures  Write applications that run effectively across architectures  Design new architectures that can effectively support legacy  applications  Programming Model Design Considerations Expose modern architectural features to exploit machine power  and improve performance Maintain Ease of Use  Two previous points mean increase productivity!  14 Tarek El-Ghazawi, GWU

Current Programming Models and Locality Awareness Process/Thread Address Space … … … Message Passing Partitioned Global Shared Memory Address Space × Locality Awareness Locality Awareness Locality Awareness -One-Sided -Two-Sided -One-Sided Communication Communication Communication -Examples UPC and -Example MPI -Example OpenMP Chapel 15 Tarek El-Ghazawi, GWU

PGAS Languages Include UPC, Chapel and X10 16

Memory Accesses in UPC- Shared Address Translation Overheads Measurement of the address 4.25 MB/s  100% space overheads 90% 14 734 MB/s Set of micro-benchmarks  % time in memory access 80% measuring the different 1736.8 12 aspects separately: 70% Network Time 10 60% Address Translation Time (ns) 50% Address Incrementation 4.53 8 4.53 40% Memory Access 6 5.25 GB/s 30% Thread 0 Thread1 Thread 4 20% 4.2 4.2 (Threads -1) 10% 2 Shared 1.42 1.42 1.42 0% 0 Private Private 0 Private 1 THREADS-1 Type of access Type of access

Memory Access Costs in Chapel  Tested shared address access costs in Chapel: Used Chapel Syntax to test  Local part of a distributed object,  un-optimized- Accessing local data without saying local Local Optimized – local part hand-  optimized by saying “local” Local and Non-Distributed   Compiler optimization -> 2x faster  Both compiler and hand optimization -> 70x faster  Compiler optimization affects remote accesses as well  Both UPC and Chapel require “ unproductive!” hand tuning to improve local shared accesses 19 Tarek El-Ghazawi, GWU

Fast Address Translation for PGAS  Software solutions  Hand tweaking – Non-productive  Compiler optimizations - reduced arithmetic for some straightforward cases  Look up tables, full and reduced- Take memory! ICPP05  TLB's ....  Hardware solutions  Create hardware that understands how to traverse the PGAS memory model and support basic costly needs  Avail it through instructions and leverage them by the compiler  Some work for UPC, little for Chapel 20 Tarek El-Ghazawi, GWU

Hardware Support for PGAS  Example Operations for Support in Hardware Shared address incrementing  Load/store to/from a PGAS shared address  Address translation support: convert a shared address to a system virtual  address used to perform the access Locality tests for remote data  Can be used to tell whether to call the network subroutines, by e.g. testing  the affinity field in a work sharing construct  Availed as ISA extension  New instructions used directly by compiler  Current hardware support and instructions only support address mapping  Future support for remote data accesses and various types of synchronizations are of interest

Hardware/Software Co-Design Platform in a Nutshell  First prototype in FPGAs, supports small core count and apps  Second is primarily software, supports bigger core counts and codes Benchmarking Benchmarking UPC Code Out of the Box Kernels Kernels New Instructions BUPC BUPC Inserted into Code Gen GasNet GasNet Ported on top of Gem5 Ported on top of Leon3 A Runtime System that Extended with proposed Leon3 Cores Gem5 recognizes and enforces PGAS hardware support the developed mapping for shared addressing Workstation Virtex-6 FPGA Cluster - Future 22 Tarek El-Ghazawi, GWU

PGAS Hardware Support Overview  shared [4] int arrayA[32]; arrayA[10] = 5; Thread 0 0 1 2 3 16 17 18 19 Th=0 Ph=0 Va=0x3f10 Address 4 5 6 7 Thread 1 pgas_inc_{x} Incrementation Shared Pointer 20 21 22 23 Representation Th=2 Ph=2 Va=0x3f18 8 9 10 11 Thread 2 24 25 26 27 pgas_st_{x} Address Translation/Store 12 13 14 15 Thread 3 28 29 30 31 Regular pointer 0xfff01203f14 representation arrayA 23 Tarek El-Ghazawi, GWU

Early Results- NPB Kernels with HW Support Gem5 Alpha 21264

Possible Solutions for Hierarchical Locality Exploitation  Rewrite your code with low-level tricks to target the underlying hierarchical architecture?  Great performance, but not productive & non-portable  Extend programming models with hierarchical syntax and semantics and ask programmers to worry about all of those hardware details? (make them hierarchical-locality-aware!)  Portable but not productive 26 Tarek El-Ghazawi, GWU

Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Technicolor in the LHC Era R. Sekhar Chivukula Michigan State University ATLAS Higgs Results

Advances and Challenges in Waveform Modeling for Gravitational-Wave Observations Alessandra

HIGH ENERGY PHYSICS at the dawn of the L.H.C. era J. Iliopoulos, ENS, Paris Les Houches Summer

5G: Bridging the Gap between Telecom and Vertical Industries Lisbon January 19th, 2017 SUMMIT

Bilgisayar Yap s Bilgisayar verilen verileri, belirlenen bir programa gre i leyen,

ADT Lists, Stacks, and Queues Instructor: Ahmed Eldawy 1 Objectives Understand the importance

Lecture 23 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture23/. into a

Checkout Recursion project from SVN Monday 10/28 If you got a D or F on Exam 1, please be

Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Technicolor in the LHC Era R. Sekhar Chivukula Michigan State University ATLAS Higgs Results

Advances and Challenges in Waveform Modeling for Gravitational-Wave Observations Alessandra

HIGH ENERGY PHYSICS at the dawn of the L.H.C. era J. Iliopoulos, ENS, Paris Les Houches Summer

5G: Bridging the Gap between Telecom and Vertical Industries Lisbon January 19th, 2017 SUMMIT

Bilgisayar Yap s Bilgisayar verilen verileri, belirlenen bir programa gre i leyen,

ADT Lists, Stacks, and Queues Instructor: Ahmed Eldawy 1 Objectives Understand the importance

Lecture 23 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture23/*.* into a

Checkout Recursion project from SVN Monday 10/28 If you got a D or F on Exam 1, please be

Lecture 23 Log into Linux. Copy files on csserver from /home/hwang/cs215/lecture23/. into a