lightweight requirements engineering for exascale co
play

Lightweight Requirements Engineering for Exascale Co-design Felix - PowerPoint PPT Presentation

Lightweight Requirements Engineering for Exascale Co-design Felix Wolf, TU Darmstadt Application System 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 1 Acknowledgement


  1. Lightweight Requirements Engineering for Exascale Co-design Felix Wolf, TU Darmstadt Application System 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 1

  2. Acknowledgement • Alexandru Calotoiu, TU Darmstadt • Alexander Graf, TU Darmstadt • Torsten Hoefler, ETH Zurich • Daniel Lorenz, TU Darmstadt • Sergei Shudler, TU Darmstadt • Sebastian Rinke, TU Darmstadt 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 2

  3. Co-design Workload System Better algorithms 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 3

  4. Current performance might be deceptive… Computation n o i t a t u p Communication m o C Communication 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 4

  5. Hardware-specific performance models Application 1 Application 1 Application 1 Application 1 Application 1 Performance Performance Performance Performance Performance Performance Performance Performance Performance model 1.1 model 1.2 model 1.3 model 1.1 model 1.2 model 1.3 Performance Performance Performance model 1.1 model 1.2 model 1.3 Performance Performance Performance model 1.1 model 1.2 model 1.3 model 1.1 model 1.2 model 1.3 … System 1 System 2 System n 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 5

  6. Application-centric requirements models Application 1 Application 1 Application 1 Application 1 Application 1 Requirments Requirments Requirments model 1 Requirments model 1 Requirments model 1 model 1 model 1 System 1 System 2 System n 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 6

  7. Data metabolism at the hardware / software interface Application Hardware 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 7

  8. Hardware-independent requirement metrics Memory #Bytes used #Bytes used #Bytes used + Stack #Loads #Loads #Loads + Stack & stores & stores distance & stores distance CPU #FLOPS #FLOPS #FLOPS #Bytes #Bytes #Bytes sent & sent & sent & received received received Network 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 8

  9. Requirements model of an application Set of functions p = #processes n = input size per process r i (p,n) with each r i representing one of the requirement metrics • All metrics refer to single process • We model neither time nor energy 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 9

  10. Lightweight requirements engineering for (exascale) co-design Collect Derive Extrapolate portable requirement to new requirement models system metrics 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 10

  11. Collection of requirement metrics Requirement Metric Profiling tool Computation # Floating-point operations Network comm. # Bytes sent & received Memory # Bytes used getrusage() footprint Memory access # Loads & stores Threadspotter Memory locality Stack distance Collection single-threaded (#FLOPS roughly independent of #threads) 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 11

  12. Modeling locality Reuse distance vs. stack distance Paratools Threadspotter A B C B A Reuse distance = 1 Stack distance=1 Reuse distance = 3 Stack distance = 2 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 12

  13. Automatic empirical performance modeling with Extra-P main() { foo() bar() compute() Instrumentation } Small-scale measurements Extra-P Input Output Human-readable, multi-parameter performance models n m A. Calotoiu, et al.: Fast Multi-Parameter j kl ( x l ) i kl ⋅ log 2 ∑ ∏ f ( x 1 ,.., x m ) = c k x l Performance Modeling ( CLUSTER ’16 ) k = 1 l = 1 www.scalasca.org/software/extra-p/download.html 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 13

  14. Test applications Kripke LULESH MILC icoFoam Relearn 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 14

  15. Experimental setup JUQUEEN @ Jülich Supercomputing Centre IBM Blue Gene/Q Lichtenberg @ TU Darmstadt Intel Xeon with Infiniband 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 15

  16. Modeling application requirements Models represent per-process effects p – number of processes n – problem size per process Lulesh Requirement Metric Model 10 5 ⋅ n ⋅ log( n ) ⋅ p 0.25 ⋅ log( p ) Computation #FLOPs 10 3 ⋅ n ⋅ p 0.25 ⋅ log( p ) Communication #Bytes sent & received 10 5 ⋅ n ⋅ log( n ) ⋅ log( p ) Memory access #Loads & stores 10 5 ⋅ n ⋅ log( n ) Memory footprint #Bytes used Constant Memory locality Stack distance 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 16

  17. Determining requirements on a new system Requirement Available sockets # Processes models Overall Requirements problem size #FLOPS #Bytes sent ... Problem size Available memory per process per process 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 17

  18. Requirements engineering process Memory capacity Memory bandwidth Computational performance Network bandwidth 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 18

  19. Case study Three system upgrades Racks x 2 Sockets x 2 Memory x 2 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 19

  20. icoFoam Baseline LULESH Relearn Three upgrades – Apps Kripke MILC summary Ratios System Upgrade A: Double the racks Problem size per process 1 1 1 1 0.5 1 Overall problem size 2 2 2 2 1 2 LULESH Computation 1 1.2 1 1 0.5 1 Communication 1 1.2 1 1 0.7 1 Memory accesses 2 1.2 2.8 2 0.7 1 Relearn MILC System Upgrade B: Double the sockets Problem size per process 0.5 0.5 0.5 0.3 0.3 0.5 Overall problem size 1 1 1 0.5 0.6 1 Kripke Computation 0.5 0.6 0.5 0.3 0.2 0.5 Communication 0.5 0.6 0.5 0.3 0.3 0.5 Memory accesses 0.5 1 1.4 1 0.5 0.5 System Upgrade C: Double the memory Problem size per process 2 1.4 2 4 1.4 2 Overall problem size 2 1.4 2 4 1.4 2 Kripke Relearn Computation 2 1.4 2 4 1.7 2 Communication 2 1.4 2 4 1.4 2 Memory accesses 2 1.4 2 4 1.4 2 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 20 icoFoam MILC

  21. Case study II Three exascale strawman systems Metric Massively Vector Hybrid parallel Nodes 2 * 10 4 5 * 10 4 10 4 Processors 2 * 10 9 5 * 10 7 10 8 Processors per node 10 5 10 3 10 4 Memory per processor 5 * 10 6 2 * 10 8 10 8 Flop/s per processor 5 * 10 8 2 * 10 10 10 10 Moderate Many but weak Few but powerful number of processors processors moderate processors Total memory: 10 PB 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 21

  22. Case study II Three exascale strawman systems Metric Massively Vector Hybrid parallel Maximum overall 10 10 10 10 10 10 Kripke problem size Minimum wall time for 0.1 0.1 0.1 benchmark problem[s] Maximum overall Lulesh 3.9 Ÿ 10 10 1.7 Ÿ 10 10 1.9 Ÿ 10 10 Bigger problem problem size versus Minimum wall time for 40 21.5 33 faster solution benchmark problem [s] Maximum overall 10 10 10 10 10 10 MILC problem size Minimum wall time for 10 2 10 2 10 2 benchmark problem [s] Relearn Maximum overall 5 Ÿ 10 10 4 Ÿ 10 12 10 12 problem size Vector system clear winner Minimum wall time for 4 0.02 0.2 benchmark problem [s] 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 22

  23. Summary Application-centric requirements models Automated • No need to integrate hardware knowledge • Generation via standard profiling tools • Memory locality taken into account Practical co-design process • Extrapolates requirements to envisaged system BOE co-design for large workloads • Points out bottlenecks on both sides 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 23

  24. Tasking Idea – separate problem decomposition from concurrency • Decompose problem into a set of tasks and insert them into task pool • Threads fetch them from there until all tasks are completed and task pool empty. Note that a task may create new tasks • Advantage: good load balance if problem is over-decomposed create tasks Thread pool Task pool Scheduler assign them to threads fetch tasks 11/17/19 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 24

Recommend


More recommend