efficient gpu synchronization without scopes
play

Efficient GPU Synchronization without Scopes: Saying No to Complex - PowerPoint PPT Presentation

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Motivation Heterogeneous systems now


  1. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu

  2. Motivation Heterogeneous systems now used for a wide variety of applications Emerging applications have fine-grained synchronization BUT current GPUs have sub-optimal consistency and coherence Consistency Coherence Defacto Data-race-free (DRF) High overhead on synchs Inefficient Simple No overhead for local synchs Heterogeneous-race-free (HRF) Recent Scoped synchronization Efficient for local synch Complex This work: simple consistency + efficient coherence 2

  3. Motivation (Cont.) Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right! DeNovo+DRF: Efficient AND simpler memory model – Comparable or better results vs. GPU+DRF and GPU+HRF complex consistency models 3

  4. Outline • Motivation • Coherence Protocols and Consistency Models – Classification – GPU Coherence – DeNovo Coherence – Coherence and Consistency Summary • Results • Conclusion 4

  5. A Classification of Coherence Protocols • Read hit: Don’t return stale data • Read miss: Find one up-to-date copy Invalidator Writer Reader Track Ownership MESI DeNovo up-to- date Writethrough GPU copy • Reader-initiated invalidations – No invalidation or ack traffic, directories, transient states • Obtaining ownership for written data – Reuse owned data across synchs (not flushed at synch points) 5

  6. GPU Coherence with DRF GPU CPU Flush dirty Invalidate Cache Cache Dirty Valid Valid data all data L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) 6

  7. GPU Coherence with HRF heterogeneous HRF [ASPLOS ’14] • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished heterogeneous and their scopes – At all synch points global • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points Global – Synchronization accesses must go to last level cache (LLC) – No overhead for locally scoped synchs • But higher programming complexity 7

  8. DeNovo Coherence with DRF GPU CPU Invalidate Obtain Cache Cache Own Dirty Valid non-owned data ownership L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data Obtain ownership for dirty data Can reuse owned data • Invalidate all non-owned data can be performed at L1 – Synchronization accesses must go to last level cache (LLC) • 3% state overhead vs. GPU coherence + HRF 8

  9. DeNovo Configurations Studied • DeNovo+DRF: – Invalidate all non-owned data at synch points • DeNovo-RO+DRF: – Avoids invalidating read-only data at synch points • DeNovo+HRF: – Reuse valid data if synch is locally scoped 9

  10. Coherence & Consistency Summary Coherence + Consistency Reuse Data Do Synchs at L1 Owned Valid GPU + DRF (GD) X X X GPU + HRF (GH) local local local DeNovo + DRF (DD)  X  DeNovo-RO + DRF (DD+RO) read-only   DeNovo + HRF (DH)  local  10

  11. Outline • Motivation • Coherence Protocols and Consistency Models • Results • Conclusion 11

  12. Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment – GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT • Workloads – 10 apps from Rodinia, Parboil: no fine-grained synch • DeNovo and GPU coherence perform comparably – UC-Davis microbenchmarks + UTS from HRF paper: • Mutex, semaphore, barrier, work sharing • Shows potential for future apps • Created two versions of each: globally, locally/hybrid scoped synch 12

  13. Global Synch – Execution Time FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% G* D* G* D* G* D* G* D* G* D* DeNovo has 28% lower execution time than GPU with global synch 13

  14. Global Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% G* D* G* D* G* D* G* D* G* D* DeNovo has 51% lower energy than GPU with global synch 14

  15. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ ASPLOS ’14 ] 15

  16. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model 16

  17. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data 17

  18. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data DeNovo+HRF is best, if consistency complexity acceptable 18

  19. Local Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ SLM SPM SPMBO SS SSBO TBEX TB UTS AVG FAM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH Energy trends similar to execution time 19

  20. Conclusions • Emerging heterogeneous apps use fine-grained synch – GPU coherence + DRF: inefficient, but simple memory model – GPU coherence + HRF: efficient, but complex memory model Do GPU models (HRF) need to be more complex than CPU models (DRF)? – DeNovo + DRF: efficient AND simple memory model complex consistency models! 20

Recommend


More recommend