P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang - PowerPoint PPT Presentation

E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten Computer Systems Laboratory Cornell University Page 1 of 26

M ANYCORE P ROCESSORS Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation Small Core Count Large Core Count Cavium NVIDIA KiloCore Tilera Intel Celerity Adapteva ThunderX GV100 GPU TILE64 Xeon Phi Epiphany 72 SM 511 Cores 1024 Cores 72 Cores 1000 Cores 48 Cores 64 Cores Hardware-Based Cache Coherence Software-Centric Cache Coherence / No Coherence Page 1 of 26

S OFTWARE C HALLENGE Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation int fib( int n ) { • Programmers expect to use if ( n < 2 ) return n; familiar shared-memory int x, y; programming models on tbb::parallel_invoke( Intel manycore processors [&] { x = fib( n - 1 ); }, TBB [&] { y = fib( n - 2 ); } • Even more difficult to allow ); return (x + y); cooperative execution } between host processor and manycore co-processor int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); Intel int y = fib( n - 2 ); Cilk Plus cilk_sync; return (x + y); } Page 2 of 22

S OFTWARE C HALLENGE Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation int fib( int n ) { • Programmers expect to use if ( n < 2 ) return n; familiar shared-memory int x, y; Host programming models on tbb::parallel_invoke( Intel Processor manycore processors [&] { x = fib( n - 1 ); }, TBB [&] { y = fib( n - 2 ); } • Even more difficult to allow ); return (x + y); cooperative execution } between host processor and manycore co-processor int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); Intel int y = fib( n - 2 ); Cilk Plus cilk_sync; return (x + y); } Page 2 of 22

C ONTRIBUTIONS Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Work-Stealing Runtime for manycore tiny tiny tiny tiny tiny tiny tiny tiny L1s L1s L1s L1s L1s L1s L1s L1s processors with heterogeneous cache R R R R R R R R coherence (HCC) ... ... ... ... ... ... ... R R R R R R R R - TBB/Cilk-like programming model tiny tiny tiny tiny tiny tiny tiny tiny L1s L1s L1s L1s L1s L1s L1s L1s - Efficient cooperative execution R R R R R R R R big tiny big tiny big tiny big tiny between big and tiny cores L1h L1s L1h L1s L1h L1s L1h L1s R R R R R R R R • Direct task stealing (DTS), a lightweight DIR DIR DIR DIR DIR DIR DIR DIR L2 L2 L2 L2 L2 L2 L2 L2 software and hardware technique to improve performance and energy efficiency MC MC MC MC MC MC MC MC A big.TINY architecture combines a few big OOO • Detailed cycle-level evaluation cores with many tiny IO cores on a single die using heterogeneous cache coherence Page 3 of 26

E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON HCC Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation tiny tiny tiny tiny tiny tiny tiny tiny • Background L1s L1s L1s L1s L1s L1s L1s L1s R R R R R R R R ... ... ... ... ... ... ... • Implementing Work-Stealing R R R R R R R R tiny tiny tiny tiny tiny tiny tiny tiny L1s L1s L1s L1s L1s L1s L1s L1s Runtimes on HCC R R R R R R R R big tiny big tiny big tiny big tiny L1h L1s L1h L1s L1h L1s L1h L1s • Direct Task Stealing R R R R R R R R DIR DIR DIR DIR DIR DIR DIR DIR L2 L2 L2 L2 L2 L2 L2 L2 • Evaluation MC MC MC MC MC MC MC MC A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence Page 4 of 26

H ETEROGENEOUS C ACHE C OHERENCE (HCC) Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • We study three exemplary software- centric cache coherence protocols: Stale Data Dirty Data Write Invalidation Propagation Granularity - DeNovo [1] Owner, MESI Writer Cache Line Write-Back - GPU Write-Through (GPU-WT) Owner, DeNovo Reader Flexible - GPU Write-Back (GPU-WB) Write-Back No-Owner, GPU-WT Reader Word • They vary in their strategies to invalidate Write-Through stale data and propagate dirty data No-Owner, GPU-WB Reader Word Write-Back • Prior work on Spandex [2] has studied how to efficiently integrate different protocols [1] H. Sung and S. V. Adve. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations. ASPLOS 2015. into HCC systems [2] J. Alsop, M. Sinclair, and S. V. Adve. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence. ISCA 2018. Page 5 of 26

D YNAMIC T ASK P ARALLELISM Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Tasks are generated dynamically at run-time • Diverse current and emerging parallel patterns: - Map (for-each) - Fork-join - Nesting • Supported by popular frameworks: - Intel Threading Building Blocks (TBB) - Intel Cilk Plus - OpenMP • Work-stealing runtimes provide automatic load-balancing Pictures from Robinson et al., Structured Parallel Programming: Patterns for Efficient Computation, 2012 Page 6 of 26

D YNAMIC T ASK P ARALLELISM Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Tasks are generated dynamically at run-time long fib( int n ) { if ( n < 2 ) return n; • Diverse current and emerging parallel patterns: long x, y; - Map (for-each) parallel_invoke( [&] { x = fib( n - 1 ); }, - Fork-join [&] { y = fib( n - 2 ); } - Nesting ); return (x + y); • Supported by popular frameworks: } - Intel Threading Building Blocks (TBB) - Intel Cilk Plus void vvadd( int a[], int b[], int dst[], - OpenMP int n ) { parallel_for( 0, n, [&]( int i ) { • Work-stealing runtimes provide automatic dst[i] = a[i] + b[i]; load-balancing }); } Page 6 of 26

D YNAMIC T ASK P ARALLELISM Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation • Tasks are generated dynamically at run-time class FibTask : public task { int n, *sum; • Diverse current and emerging parallel patterns: - Map (for-each) void execute() { if ( n < 2 ) { - Fork-join *sum = n; - Nesting return ; } • Supported by popular frameworks: long x, y; - Intel Threading Building Blocks (TBB) FibTask a( n - 1, &x ); FibTask b( n - 2, &y ); - Intel Cilk Plus this ->reference_count = 2; - OpenMP task::spawn( &a ); task::spawn( &b ); • Work-stealing runtimes provide automatic task::wait( this ); load-balancing *sum = x + y; } } Page 6 of 26

W ORK -S TEALING R UNTIMES Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); Check local task queue task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

W ORK -S TEALING R UNTIMES Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); Check local task queue task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); Execute dequeued task amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

W ORK -S TEALING R UNTIMES Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); Check local task queue task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); Execute dequeued task amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); Steal from another queue if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation W ORK -S TEALING R UNTIMES void task::wait( task* p ) { while ( p->ref_count > 0 ) { Task task_queue[tid].lock_acquire(); Queues task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); Work in amo_sub( t->parent->ref_count, 1 ); Progress Core 0 Core 1 Core 2 Core 3 } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } } Page 7 of 26

P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang - PowerPoint PPT Presentation

E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten Computer Systems Laboratory Cornell University Page 1 of 26 M ANYCORE P ROCESSORS Motivation

ACHE 2014 Survey Comparing Career Attainments of Healthcare Executives by Race/Ethnicity 1

S CATTER C ACHE : Thwarting Cache Attacks via Cache Set Randomization Werner, Unterluggauer,

D YNAMIC F INE -G RAIN S CHEDULING OF P IPELINE P ARALLELISM Daniel Sanchez, David Lo, Richard M.

H eterogeneous A nodes R apidly P erused for O 2 O verpotential N eutralization

P ARAGON : Q O S-A WARE S CHEDULING F OR H ETEROGENEOUS D ATACENTERS Christina Delimitrou and

At a Glance: SmoothIt Partners : Project Coordinator University of Zurich (CH) Prof. Dr.

How ow ma many ny te teac ache hers rs an and p d pup upils en enjoy oy cod oding ng?

A mnesic C ache M anagement for Non-Volatile Memory Dongwoo Kang , Seungjae Baek, Jongmoo Choi

San J San Jua uan C Colle llege Teac ache her E Education n Programs Alexis Domme

THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN ,

New Appr Approac ache hes in M in Med edia C ia Cons nser ervatjo tjon Artworks from th

STRESS: When is it a problem? Fidgeting Picking (skin) Nail biting Stomach ache

V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P ARTITIONING Daniel Sanchez and Christos

W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag

i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira

My three main points 1.Parallel programming and functional programming are intimately connected

Analysis and modeling of the KAD P2P network Bachelor thesis summary presentation Maximilian

Crowdsourcing semantic data management: challenges and opportunities Elena Simperl Karlsruhe

Data-driven growth : From finding product-market fit to scaling Sep 13, 2016 Agenda:

Board of Visitors Finance Committee Meeting June 2016 Finance Committee Agenda Consent Agenda:

USA VOLLEYBALL JUNIOR PLAYER AGE DEFINITION For use during the 2019-2020 Season To determine the

Three Years in the Life of the Spoofer Project Matthew Luckie, Ken Keys, Ryan Koga, Robert

Whose Internet Is It, Anyway? Blackhat DC 2010 Andrew Fried, ISC, SURBL Richard Cox, Spamhaus

The fight against SPAM An Internet Number Resources

Sambuz

Useful Links

Newsletter

Mail Us