p arallelism on h eterogeneous c ache
play

P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang - PowerPoint PPT Presentation

E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten Computer Systems Laboratory Cornell University Page 1 of 26 M ANYCORE P ROCESSORS Motivation


  1. E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten Computer Systems Laboratory Cornell University Page 1 of 26

  2. M ANYCORE P ROCESSORS Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation Small Core Count Large Core Count Cavium NVIDIA KiloCore Tilera Intel Celerity Adapteva ThunderX GV100 GPU TILE64 Xeon Phi Epiphany 72 SM 511 Cores 1024 Cores 72 Cores 1000 Cores 48 Cores 64 Cores Hardware-Based Cache Coherence Software-Centric Cache Coherence / No Coherence Page 1 of 26

  3. S OFTWARE C HALLENGE Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation int fib( int n ) { • Programmers expect to use if ( n < 2 ) return n; familiar shared-memory int x, y; programming models on tbb::parallel_invoke( Intel manycore processors [&] { x = fib( n - 1 ); }, TBB [&] { y = fib( n - 2 ); } • Even more difficult to allow ); return (x + y); cooperative execution } between host processor and manycore co-processor int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); Intel int y = fib( n - 2 ); Cilk Plus cilk_sync; return (x + y); } Page 2 of 22

  4. S OFTWARE C HALLENGE Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation int fib( int n ) { • Programmers expect to use if ( n < 2 ) return n; familiar shared-memory int x, y; Host programming models on tbb::parallel_invoke( Intel Processor manycore processors [&] { x = fib( n - 1 ); }, TBB [&] { y = fib( n - 2 ); } • Even more difficult to allow ); return (x + y); cooperative execution } between host processor and manycore co-processor int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); Intel int y = fib( n - 2 ); Cilk Plus cilk_sync; return (x + y); } Page 2 of 22

  5. C ONTRIBUTIONS Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Work-Stealing Runtime for manycore tiny tiny tiny tiny tiny tiny tiny tiny L1s L1s L1s L1s L1s L1s L1s L1s processors with heterogeneous cache R R R R R R R R coherence (HCC) ... ... ... ... ... ... ... R R R R R R R R - TBB/Cilk-like programming model tiny tiny tiny tiny tiny tiny tiny tiny L1s L1s L1s L1s L1s L1s L1s L1s - Efficient cooperative execution R R R R R R R R big tiny big tiny big tiny big tiny between big and tiny cores L1h L1s L1h L1s L1h L1s L1h L1s R R R R R R R R • Direct task stealing (DTS), a lightweight DIR DIR DIR DIR DIR DIR DIR DIR L2 L2 L2 L2 L2 L2 L2 L2 software and hardware technique to improve performance and energy efficiency MC MC MC MC MC MC MC MC A big.TINY architecture combines a few big OOO • Detailed cycle-level evaluation cores with many tiny IO cores on a single die using heterogeneous cache coherence Page 3 of 26

  6. E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON HCC Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation tiny tiny tiny tiny tiny tiny tiny tiny • Background L1s L1s L1s L1s L1s L1s L1s L1s R R R R R R R R ... ... ... ... ... ... ... • Implementing Work-Stealing R R R R R R R R tiny tiny tiny tiny tiny tiny tiny tiny L1s L1s L1s L1s L1s L1s L1s L1s Runtimes on HCC R R R R R R R R big tiny big tiny big tiny big tiny L1h L1s L1h L1s L1h L1s L1h L1s • Direct Task Stealing R R R R R R R R DIR DIR DIR DIR DIR DIR DIR DIR L2 L2 L2 L2 L2 L2 L2 L2 • Evaluation MC MC MC MC MC MC MC MC A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence Page 4 of 26

  7. H ETEROGENEOUS C ACHE C OHERENCE (HCC) Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • We study three exemplary software- centric cache coherence protocols: Stale Data Dirty Data Write Invalidation Propagation Granularity - DeNovo [1] Owner, MESI Writer Cache Line Write-Back - GPU Write-Through (GPU-WT) Owner, DeNovo Reader Flexible - GPU Write-Back (GPU-WB) Write-Back No-Owner, GPU-WT Reader Word • They vary in their strategies to invalidate Write-Through stale data and propagate dirty data No-Owner, GPU-WB Reader Word Write-Back • Prior work on Spandex [2] has studied how to efficiently integrate different protocols [1] H. Sung and S. V. Adve. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations. ASPLOS 2015. into HCC systems [2] J. Alsop, M. Sinclair, and S. V. Adve. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence. ISCA 2018. Page 5 of 26

  8. D YNAMIC T ASK P ARALLELISM Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Tasks are generated dynamically at run-time • Diverse current and emerging parallel patterns: - Map (for-each) - Fork-join - Nesting • Supported by popular frameworks: - Intel Threading Building Blocks (TBB) - Intel Cilk Plus - OpenMP • Work-stealing runtimes provide automatic load-balancing Pictures from Robinson et al., Structured Parallel Programming: Patterns for Efficient Computation, 2012 Page 6 of 26

  9. D YNAMIC T ASK P ARALLELISM Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Tasks are generated dynamically at run-time long fib( int n ) { if ( n < 2 ) return n; • Diverse current and emerging parallel patterns: long x, y; - Map (for-each) parallel_invoke( [&] { x = fib( n - 1 ); }, - Fork-join [&] { y = fib( n - 2 ); } - Nesting ); return (x + y); • Supported by popular frameworks: } - Intel Threading Building Blocks (TBB) - Intel Cilk Plus void vvadd( int a[], int b[], int dst[], - OpenMP int n ) { parallel_for( 0, n, [&]( int i ) { • Work-stealing runtimes provide automatic dst[i] = a[i] + b[i]; load-balancing }); } Page 6 of 26

  10. D YNAMIC T ASK P ARALLELISM Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Tasks are generated dynamically at run-time class FibTask : public task { int n, *sum; • Diverse current and emerging parallel patterns: - Map (for-each) void execute() { if ( n < 2 ) { - Fork-join *sum = n; - Nesting return ; } • Supported by popular frameworks: long x, y; - Intel Threading Building Blocks (TBB) FibTask a( n - 1, &x ); FibTask b( n - 2, &y ); - Intel Cilk Plus this ->reference_count = 2; - OpenMP task::spawn( &a ); task::spawn( &b ); • Work-stealing runtimes provide automatic task::wait( this ); load-balancing *sum = x + y; } } Page 6 of 26

  11. W ORK -S TEALING R UNTIMES Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); Check local task queue task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

  12. W ORK -S TEALING R UNTIMES Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); Check local task queue task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); Execute dequeued task amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

  13. W ORK -S TEALING R UNTIMES Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); Check local task queue task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); Execute dequeued task amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); Steal from another queue if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

  14. Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation W ORK -S TEALING R UNTIMES void task::wait( task* p ) { while ( p->ref_count > 0 ) { Task task_queue[tid].lock_acquire(); Queues task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); Work in amo_sub( t->parent->ref_count, 1 ); Progress Core 0 Core 1 Core 2 Core 3 } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

Recommend


More recommend