towards exploiting data locality for irregular
play

Towards Exploiting Data Locality for Irregular Applications on - PowerPoint PPT Presentation

Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17,


  1. Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17, 2016

  2. Outline 1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work May 17, 2016 Cheng Wang (cwang35@uh.edu) 2 / 22

  3. The Reality of Parallel Computing ... 1Slide based on a post from https://highscalability.com May 17, 2016 Cheng Wang (cwang35@uh.edu) 3 / 22

  4. Why CPU Caching Matters? Performance CPU Processor-Memory Performance Gap Memory year 1980 2004 2014 Processor-memory Memory hierarchy performance gap • Memory has become the principle performance bottleneck • Improve the cache utilization is the key to performance optimization 1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html May 17, 2016 Cheng Wang (cwang35@uh.edu) 4 / 22

  5. Shared-Memory Multicore Architectures 1 Shared memory • On-chip : (Last-level) cache shared by homo/hetero processors • Off-chip : Main memory shared by homo/hetero processors May 17, 2016 Cheng Wang (cwang35@uh.edu) 5 / 22

  6. What are Irregular Applications? do i = 1 , N . . . = x [ idx [ i ] ] end do 1 Indirect array reference pattern 2 Commonly found in linked list, tree and graph-based applications 3 Poor data locality 4 Challenge esp. for shared-memory multicore architecture as cores compete for memory bandwidth May 17, 2016 Cheng Wang (cwang35@uh.edu) 6 / 22

  7. Approach: Computation/Data Reordering 3 1 2 4 Computation reordering Computation 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Data Data reordering 2 3 1 4 May 17, 2016 Cheng Wang (cwang35@uh.edu) 7 / 22

  8. Challenges in Dynamic Irregularity Removal 1 Dynamic irregularity • Memory access pattern remains unknown until runtime and may change during computations • Previous work on compile-time transformations can hardly apply • Need for transformation at runtime 2 Runtime transformation overhead • Transformation overhead is placed on the critical path of the application’s execution • The benefits of improved data locality must outweigh the cost of the data layout transformation at runtime May 17, 2016 Cheng Wang (cwang35@uh.edu) 8 / 22

  9. 1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

  10. Sparse FFT FFT 1 A novel compressive sensing algorithm with massive application domains 2 Fourier transform is dominated by a small number of “peaks” • FFT( O ( nlogn )) is inefficient 3 Compute the k-sparse Fourier transform in lower time complexity • k-sparse : no. of “large” coordinates at freq. domain May 17, 2016 Cheng Wang (cwang35@uh.edu) 10 / 22

  11. Sparse Data is Ubiquitous ... 1Slide based on http://groups.csail.mit.edu/netmit/sFFT/ May 17, 2016 Cheng Wang (cwang35@uh.edu) 11 / 22

  12. Irregular Memory Access Pattern in Sparse FFT n coordinates • Randomly permutes the signal spectrum and bins into a small Irregular data number of buckets reference pattern • Irregular memory access pattern B buckets buckets[i % B] += signal[idx] * fi lter[i] May 17, 2016 Cheng Wang (cwang35@uh.edu) 12 / 22

  13. Parallel Sparse FFT 1 Modern architectures are exclusively based on multicore and manycore processors • e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc. • Nature path to improve the performance of sFFT through efficient parallel algorithm design and impl. 2 Standard full-size FFT has been well studied and implemented • FFTW, cuFFT, Intel MLK, etc.. • Highly optimized for specific architectures 3 We are the first such effort of high-performance parallel sFFT implementation 1 cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs, C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016). [To Appear] May 17, 2016 Cheng Wang (cwang35@uh.edu) 13 / 22

  14. Exec. Time: cusFFT vs. sFFT vs. cuFFT ( k = 1000) GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge) 10 sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT 1 Execution Time (sec) 0.1 0.01 0.001 19 20 21 22 23 24 25 26 27 Signal Size (2 n ) • cuFFT : full-size FFT library on Nvidia GPUs • The MIT seq. sFFT is slower than cuFFT • cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT • cusFFT is up to 12x faster than cuFFT May 17, 2016 Cheng Wang (cwang35@uh.edu) 14 / 22

  15. Exec. Time cusFFT vs. sFFT vs. cuFFT ( n = 2 25 ) GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge) 10 Execution Time (sec) 1 0.1 sFFT (MIT) PsFFT (6 threads) cusFFT cuFFT 0.01 5000 10000 15000 20000 25000 30000 35000 40000 Signal Sparsity k • The seq. sFFT is slower than cuFFT • PsFFT is faster than cuFFT until k = 3000 • cusFFT is faster than cuFFT until k = 41 , 000 May 17, 2016 Cheng Wang (cwang35@uh.edu) 15 / 22

  16. 1 What are irregular applications? 2 Sparse FFT - A case study of irregular applications 3 A padding algorithm to improve the data locality 4 Conclusion & Future work

  17. Rethink the Consecutive Packing (CPACK) Algorithm CPACK: A greedy algorithm which packs data into consecutive locations in the order they are first accessed by the computation miss Original Data reordered by CPACK hit Computation Computation 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CPACK ... ... ... 9 23 67 103 9 23 67 103 Data Data Data access order: 9, 23, 103, 23, 67, 23, 67 6 cache miss 7 cache misses • First-touch policy packs (9,23) together • Not optimal May 17, 2016 Cheng Wang (cwang35@uh.edu) 17 / 22

  18. Rethink the Consecutive Packing (CPACK) Algorithm Affinity-conscious data reordering ... miss Original An Optimal data layout hit Computation Computation 1 2 3 4 5 6 7 1 2 3 4 5 6 7 data reordering ... ... ... 9 23 67 103 9 103 23 67 Data Data Data access order: 9, 23, 103, 23, 67, 23, 67 4 cache miss 7 cache misses • CPACK does not consider data affinity (i.e., how close the nearby data elements are accessed together) • Packs (23,67) rather than (9,23) should yield better locality May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22

  19. Data Reordering and NP-completeness 1 Finding an optimal data layout is a NP-complete problem 1 2 No “best” data reordering algorithm that works in general 3 Implicit constraint : Each data entry has only one copy in the transformed format 4 The complexity can be significantly reduced if more space is allowed to use 1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02) May 17, 2016 Cheng Wang (cwang35@uh.edu) 19 / 22

  20. A Padding Algorithm that Circumvents the Complexity CPACKE Algorithm : Extends the CPACK by creating duplicated copies of each repeatedly accessed data entry miss Original Padding algorithm hit Computation Computation 1 2 3 4 5 6 7 1 2 3 4 5 6 7 data reordering ... ... ... 23 67 103 9 103 67 67 9 23 23 23 Data Data 4 cache miss Data access order: 9, 23, 103, 23, 67, 23, 67 7 cache misses • Advantage : Better locality than CPACK • Disadvantage : Slight space overhead May 17, 2016 Cheng Wang (cwang35@uh.edu) 20 / 22

  21. Performance Evaluation Intel Xeon E5-2670 (Sandy Bridge) 0.9 PsFFT (before trans) PsFFF (after trans) 0.8 0.7 Execution time (sec) 0.6 0.5 0.4 0.3 0.2 0.1 0 19 20 21 22 23 24 25 26 27 28 Signal size (2 n ) • Applies the CPACKE to the perm+filter stage in sFFT • Improves the performance by 30% for the irregular kernel • Improves the overall performance of PsFFT by 20% May 17, 2016 Cheng Wang (cwang35@uh.edu) 21 / 22

  22. Conclusion & Future Work 1 A padding-based algorithm improving the data locality of irregular applications 2 Improves the performance of sFFT by 30% 3 Future work • Evaluate with more irregular applications • Evaluate with other data/computation reordering algorithms • Let compiler generate the transformed the code automatically May 17, 2016 Cheng Wang (cwang35@uh.edu) 22 / 22

Recommend


More recommend