analysis of data reuse in task parallel runtimes
play

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , - PowerPoint PPT Presentation

Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as , Abdelhalim Amer , Kenjiro Taura and Satoshi Matsuoka Tokyo Institute of Technology The University of Tokyo PMBS13, Denver, November 18th 2013 1 Table


  1. Analysis of Data Reuse in Task Parallel Runtimes Miquel Peric` as ⋆ , Abdelhalim Amer ⋆ , Kenjiro Taura † and Satoshi Matsuoka ⋆ ⋆ Tokyo Institute of Technology † The University of Tokyo PMBS’13, Denver, November 18th 2013 1

  2. Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 2

  3. Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 3

  4. Task Parallel Programming Models • Task-parallel programming models are popular tools for multicore programming • They are general, simple and can be implemented efficiently Tasks DAG Runtime Layer (Cilk, TBB, OpenMP, ..) C C C C Cores • Task-parallel runtimes manage assignation of tasks to cores, allowing programmers to write cleaner code PMBS’13, Denver, November 18th 2013 4

  5. Performance of Runtime Systems • Runtime schedulers implement heuristics to maximize parallelism and optimize resource sharing • Performance can depend considerably on such heuristics, degradation often occurs without any obvious reason 24 Runtime A Runtime B Runtime C Runtime D 20 Linear Speed-Up 16 ? 12 8 4 0 1 2 4 6 12 18 24 Number of Cores PMBS’13, Denver, November 18th 2013 5

  6. Scalability of task parallel applications Why do task parallel codes not scale linearly? • Runtime Overheads : execution cycles inside API calls • Parallel Idleness : lost cycles due to load imbalance and lack of parallelism • Resource Sharing : additional cycles due to contention or destructive sharing → work time inflation (WTI) PMBS’13, Denver, November 18th 2013 6

  7. Quantifying Parallelization Stretch • OVR N = Non-work Overheads at N cores (API + IDLE time) • WTI N = Work Time Inflation at N cores Serial` W1 s W2 s W3 s W4 s W5 s W6 s WTI 4 OVR 4 W1 p W5 p API IDLE Core 1 Core 1 1 2 W2 p W6 p API Parallel Core 2 W3 p 1 3 4 2 5 1 2 6 IDLE Core 3 W4 p IDLE Core 4 Parallel Stretch Tpar = Tser N × WTI N × OVR N → Speed-Up N = N OVR N × WTI N PMBS’13, Denver, November 18th 2013 7

  8. Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 8

  9. Case Study: Matmul and FMM Matrix Multiplication (C = A x B) • Input Size : 4096 × 4096 elements • Task Inputs/Outputs : 2D submatrices • Average task size 1 : 17 µ s Fast Multipole Method: Tree Traversal 2 • Input Size : 1 million particles (Plummer) • Task Inputs/Outputs : octree cells (multipoles and vectors of bodies) • Average task size : 3 . 25 µ s 1 measured on Intel Xeon E7-4807 at 1.86GHz 2 https://bitbucket.org/rioyokota/exafmm-dev

  10. Case Study: three runtimes • MassiveThreads : Cilk-like runtime with random work stealer and work-first policy. • Threading Building Blocks : C++ template based runtime with random work stealer and help-first policy. • Qthread : Locality-aware runtime with shared task queue. A set of workers are grouped in a shepherd . Bulk work stealing across shepherds (50% of victim’s tasks). Help-first policy. C C C C C C C C C C C C C C C C C C C C LIFO LIFO local task local task Task Queues Task Queues scheduling scheduling LIFO global task scheduling NUMA FIFO FIFO node #2 (shepherd) Work Work stealing stealing “ shepherd ” bulk FIFO Work First Help First work stealing MassiveThreads Thread Building Blocks Qthread PMBS’13, Denver, November 18th 2013 10

  11. Experimental Setup • Experimental platform is a 4-socket Intel Xeon E7- 4807 (Westmere) machine with 6 cores per die (1.87GHz) and 18MB of LLC. • We specify the same subset of cores for every experiment • The following runtime configurations are used: Runtime Task Creation Work Stealing Task Queue MTH Work-First Random / 1 task Core/LIFO TBB Help-First Random / 1 task Core/LIFO QTH/Core Help-First Random / Bulk Core/LIFO QTH/Socket Help-First Random / Bulk Socket/LIFO PMBS’13, Denver, November 18th 2013 11

  12. Speed-Ups for Matmul and FMM Matmul FMM 24 24 MassiveThreads MassiveThreads Threading Building Blocks Threading Building Blocks QThread/Core QThread/Core 20 20 QThread/Socket QThread/Socket Linear Linear 16 16 Speed-Up Speed-Up 12 12 8 8 4 4 0 0 1 2 4 6 12 18 24 1 2 4 6 12 18 24 Number of Cores Number of Cores Performance Variation at 24 Cores: • Matmul: 16 × –21 × (MTH best, QTH/Socket worst) • FMM: 9 × –18 × (MTH best, TBB worst) PMBS’13, Denver, November 18th 2013 12

  13. Overheads (OVR N ) for Matmul and FMM Matmul FMM 2.2 2.2 MassiveThreads MassiveThreads Threading Building Blocks Threading Building Blocks QThread/Core QThread/Core 2 2 Non-Work Overheads QThread/Socket Non-Work Overheads QThread/Socket 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1 1 1 2 4 6 12 18 24 1 2 4 6 12 18 24 Number of cores Number of cores Overheads are obtained by measuring the time cores spend outside of work kernels. At 24 cores: • Matmul: 1.1 × –1.4 × (MTH best; QTH/Socket worst) • FMM: 1.3 × –2.2 × (MTH best; TBB and QTH/Socket worst) PMBS’13, Denver, November 18th 2013 13

  14. Do overheads alone explain performance? Normalized speed-up overhead product OVR N × WTI N → Speed-Up N × OVR N N 1 Speed-Up N = = N WTI N • The normalized speed-up overhead product is a measure of performance loss due to resource sharing • A value of 1.0 means no work time inflation is occurring PMBS’13, Denver, November 18th 2013 14

  15. Normalized speed-up overhead product Matmul FMM 1.05 1.05 Normalized Speed-Up x Overhead Product Normalized Speed-Up x Overhead Product 1 1 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 MassiveThreads MassiveThreads Threading Building Blocks Threading Building Blocks QThread/Core QThread/Core QThread/Socket QThread/Socket 0.75 0.75 1 2 4 6 12 18 24 1 2 4 6 12 18 24 Number of Cores Number of cores Speed-up degradation due to resource contention • Matmul: 2%–10% (MTH best; TBB worst) • FMM: 2%–18% (MTH best; TBB worst) • Reason? cache effects due to different orders of tasks PMBS’13, Denver, November 18th 2013 15

  16. Performance bottlenecks analysis • Overheads can be studied with a variety of tools • Sampling-based : perf 1 , HPCToolkit 2 , extrae 3 , etc • Tracing-based : vampirtrace 4 , TAU 5 , extrae, etc • Runtime library support • How can we analyze the impact of different runtime schedulers on data locality? → Proposal : use the reuse distance to evaluate cache performance 1 https://perf.wiki.kernel.org 2 http://hpctoolkit.org/ 3 http://www.bsc.es/computer-sciences/performance-tools/paraver 4 http://www.vampir.eu 5 http://tau.uoregon.edu

  17. Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 17

  18. Multicore-aware Reuse Distance Single-threaded Reuse Distance Multi-threaded Reuse Distance @ @ @ CORE CORE CORE a e #1 a b f #1 #2 2 b c g 4 d e c L1 4 e h L1 L1 d a i e f j L2 2 g k 1 a L2 L2 f i f g 1 a f e L3 f b c g % % e d 100 100 e a h i 0 dist f dist 0 ∞ 1 4 ∞ j 1 4 k g f i • Generation of full address traces is too intrusive → changes task schedules • Computing the reuse distance is expensive PMBS’13, Denver, November 18th 2013 18

  19. Lightweight data tracing We make several assumptions to reduce the cost of the metric • Cache performance is dominated by global (shared) data → short lived stack variables are not tracked. Only the kernel inputs/outputs are recorded. • Performance is dominated by last level cache misses → we interleave the address streams of all threads and compute the reuse distance histogram • For large reuse distances individual LD/ST tracking is not needed → we record kernel inputs at bulk (timestamp, address, size) PMBS’13, Denver, November 18th 2013 19

  20. Kernel Reuse Distance (KRD) Kernel Access Trace Kernel Access Trace 2) Merging/Synchronization CORE #1 CORE #2 101112 101112 1 2 3 4 5 6 7 8 9 10 11 12 3) Histogram 9 9 L1 L1 CORE CORE Generation #1 8 8 #2 L2 L2 7 7 6 6 LLC 5 5 first time accesses 4 4 2 3 2 3 MAIN MEMORY 1 1 1) Kernel Data Trace Generation PMBS’13, Denver, November 18th 2013 20

  21. Kernel Reuse Distance: Application Kernel Reuse Distance (KRD) KRD provides an intuitive measure of data reuse quality. We want to make quick assessments on reuse, comparing the performance of different schedulers PMBS’13, Denver, November 18th 2013 21

  22. Table of Contents 1 Task Parallel Runtimes 2 Case Study of Matmul and FMM 3 Kernel Reuse Distance 4 Experimental Evaluation KRD histograms and runtime schedulers KRD histograms and performance 5 Current Weaknesses 6 Conclusions PMBS’13, Denver, November 18th 2013 22

  23. Instrumentation • We record submatrices for matmul , and multipoles and body arrays for FMM • Total overhead below 5% for FMM and below 1% for Matmul • As memory traces record data regions, histogram generation is much faster PMBS’13, Denver, November 18th 2013 23

Recommend


More recommend