hierarchical locality and parallel programming in the
play

Hierarchical Locality and Parallel Programming in the Extreme Scale - PowerPoint PPT Presentation

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016 Overview Fundamental Challenges for Extreme Computing


  1. Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The George Washington University University of Southern California September 29, 2016

  2. Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 2

  3. Top Ten Challenges for Exascale: Areas where Research and advances are needed! Energy Efficiency 1. Interconnect Technology 2. Memory Technology 3. Scalable System Software 4. Programming Systems 5. Data Management 6. Exascale Algorithms 7. Algorithms for Discovery, Design 8. & Decision DoE ASCAC Resilience and Correctness 9. Subcommittee Report 10. Scientific Productivity Feb 2014 Data movement and/or programming related

  4. Technological Challenges: Combined Bandwidth and Energy Challenges for Exascale Bandwidth density vs. system distance Energy vs. system distance [Source: ASCAC 14]  Locality and data movement matter a lot, cost (energy and time) rapidly increases with distance  Locality and data movement are critical even at short distance, more so at far distances

  5. Technological Challenges : (2) Bandwidth Widening gap between available I/O Growing manycore bandwidth requirements and compute capability 0.35 Xeon Phi (Knights Corner) 0.3 Bytes/FLOP 0.25 NVIDIA K20 0.2 K40 0.15 K80 Xeon Phi 0.1 (Knights 0.05 Landing) 0 2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5 Year Ref: Miller, D. A, Proceedings of the IEEE , 2009.  Interconnect is not keeping up with the growth in compute capability Many apps require 1 Byte/FLOP off-chip, not possible in 10 TFLOPs chips and beyond  Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP Huge bandwidth density (GB/s/ μ m) needed on-chip due to large #cores in small area 

  6. Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 6

  7. Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems 7 Tarek El-Ghazawi, GWU

  8. Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems 8 Tarek El-Ghazawi, GWU

  9. Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems 9 Tarek El-Ghazawi, GWU

  10. Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems 10 Tarek El-Ghazawi, GWU

  11. Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems Cray XC40 11 Tarek El-Ghazawi, GWU

  12. Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems  TTT TILE64 Tile64 Cray XC40 12 Tarek El-Ghazawi, GWU

  13. Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 13

  14. Where are Programming Models from That?  What is a programming model? An abstract virtual machine  A view of data and execution  The logical interface between architecture and applications   Why Programming Models? Decouple applications and architectures  Write applications that run effectively across architectures  Design new architectures that can effectively support legacy  applications  Programming Model Design Considerations Expose modern architectural features to exploit machine power  and improve performance Maintain Ease of Use  Two previous points mean increase productivity!  14 Tarek El-Ghazawi, GWU

  15. Current Programming Models and Locality Awareness Process/Thread Address Space … … … Message Passing Partitioned Global Shared Memory Address Space × Locality Awareness Locality Awareness Locality Awareness -One-Sided -Two-Sided -One-Sided Communication Communication Communication -Examples UPC and -Example MPI -Example OpenMP Chapel 15 Tarek El-Ghazawi, GWU

  16. PGAS Languages Include UPC, Chapel and X10 16

  17. Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 17

  18. Memory Accesses in UPC- Shared Address Translation Overheads Measurement of the address 4.25 MB/s  100% space overheads 90% 14 734 MB/s Set of micro-benchmarks  % time in memory access 80% measuring the different 1736.8 12 aspects separately: 70% Network Time 10 60% Address Translation Time (ns) 50% Address Incrementation 4.53 8 4.53 40% Memory Access 6 5.25 GB/s 30% Thread 0 Thread1 Thread 4 20% 4.2 4.2 (Threads -1) 10% 2 Shared 1.42 1.42 1.42 0% 0 Private Private 0 Private 1 THREADS-1 Type of access Type of access

  19. Memory Access Costs in Chapel  Tested shared address access costs in Chapel: Used Chapel Syntax to test  Local part of a distributed object,  un-optimized- Accessing local data without saying local Local Optimized – local part hand-  optimized by saying “local” Local and Non-Distributed   Compiler optimization -> 2x faster  Both compiler and hand optimization -> 70x faster  Compiler optimization affects remote accesses as well  Both UPC and Chapel require “ unproductive!” hand tuning to improve local shared accesses 19 Tarek El-Ghazawi, GWU

  20. Fast Address Translation for PGAS  Software solutions  Hand tweaking – Non-productive  Compiler optimizations - reduced arithmetic for some straightforward cases  Look up tables, full and reduced- Take memory! ICPP05  TLB's ....  Hardware solutions  Create hardware that understands how to traverse the PGAS memory model and support basic costly needs  Avail it through instructions and leverage them by the compiler  Some work for UPC, little for Chapel 20 Tarek El-Ghazawi, GWU

  21. Hardware Support for PGAS  Example Operations for Support in Hardware Shared address incrementing  Load/store to/from a PGAS shared address  Address translation support: convert a shared address to a system virtual  address used to perform the access Locality tests for remote data  Can be used to tell whether to call the network subroutines, by e.g. testing  the affinity field in a work sharing construct  Availed as ISA extension  New instructions used directly by compiler  Current hardware support and instructions only support address mapping  Future support for remote data accesses and various types of synchronizations are of interest

  22. Hardware/Software Co-Design Platform in a Nutshell  First prototype in FPGAs, supports small core count and apps  Second is primarily software, supports bigger core counts and codes Benchmarking Benchmarking UPC Code Out of the Box Kernels Kernels New Instructions BUPC BUPC Inserted into Code Gen GasNet GasNet Ported on top of Gem5 Ported on top of Leon3 A Runtime System that Extended with proposed Leon3 Cores Gem5 recognizes and enforces PGAS hardware support the developed mapping for shared addressing Workstation Virtex-6 FPGA Cluster - Future 22 Tarek El-Ghazawi, GWU

  23. PGAS Hardware Support Overview  shared [4] int arrayA[32]; arrayA[10] = 5; Thread 0 0 1 2 3 16 17 18 19 Th=0 Ph=0 Va=0x3f10 Address 4 5 6 7 Thread 1 pgas_inc_{x} Incrementation Shared Pointer 20 21 22 23 Representation Th=2 Ph=2 Va=0x3f18 8 9 10 11 Thread 2 24 25 26 27 pgas_st_{x} Address Translation/Store 12 13 14 15 Thread 3 28 29 30 31 Regular pointer 0xfff01203f14 representation arrayA 23 Tarek El-Ghazawi, GWU

  24. Early Results- NPB Kernels with HW Support Gem5 Alpha 21264

  25. Overview  Fundamental Challenges for Extreme Computing  Locality and Hierarchical Locality  Programming Models  Hardware Support for Productive Locality Exploitation- Address Remapping  Hierarchical Locality Exploitation  Concluding Remarks Tarek El-Ghazawi, GWU September 29, 2016 25

  26. Possible Solutions for Hierarchical Locality Exploitation  Rewrite your code with low-level tricks to target the underlying hierarchical architecture?  Great performance, but not productive & non-portable  Extend programming models with hierarchical syntax and semantics and ask programmers to worry about all of those hardware details? (make them hierarchical-locality-aware!)  Portable but not productive 26 Tarek El-Ghazawi, GWU

Recommend


More recommend