quantifying the performance impacts of using local memory
play

Quantifying the Performance Impacts of Using Local Memory for - PowerPoint PPT Presentation

Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1 , Ana Lucia Varbanescu 2 , Henk Sips 1 1 Delft University of Technology 2 University of Amsterdam The Netherlands MuCoCoS'13: Quantifying the


  1. Quantifying the Performance Impacts of Using Local Memory for Many-Core Processors Jianbin Fang 1 , Ana Lucia Varbanescu 2 , Henk Sips 1 1 Delft University of Technology 2 University of Amsterdam The Netherlands MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 1

  2. Looking Back on OpenCL  OpenCL- Open Computing Language  An open programming framework by Khronos group  Heterogeneous platforms CPUs, GPU, MIC, FPGA, DSPs, …  OpenCL platform model  An OpenCL program  Kernel: a language based C99  Host: a set of APIs  Adopted by many vendors  Current version: v2.0 (July 2013) MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 2

  3. OpenCL and Local Memory  Local memory is a key performance factor  FAST: On-chip  Not-a-Cache: User-managed  Current status: using local memory is a trial-and-error process  Work hard to enable it …  and hope for performance gain. MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 3

  4. Our Idea  Performance impact estimation  How can we estimate the benefits of using local memory?  Assess the necessity of using local memory  Facilitate performance modeling of OpenCL platforms MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 4

  5. Local Memory “Myths”  Local memory assumptions for performance gain:  Data sharing is mandatory  Using LM on GPUs is mandatory  Using LM on CPUs must be avoided  We contradict these myths!  Data reuse is not equivalent with LM performance gain  Enabling LM on GPUs can be skipped  Enabling LM on CPUs can be beneficial MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 5

  6. Data reuse ≠ Performance gain  NBody on GTX580  Threads share exactly the same data set  Results (in GB/s)  Conclusion  Using local memory performs worse by 18% on average MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 6

  7. No data reuse ≠ Performance loss D W l  Describe analysis W g D W g ’ D ’  Conclusion  Besides data reuse (D ), access order change matters (W )!  Matrix transpose is a good example. MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 7

  8. LM on CPUs ≠ Performance loss  Image convolution on CPU  Intel Xeon E5620 (6 cores)  Filter radius is 3  Results (in GB/s)  Conclusion  Using local memory delivers (2x) better performance MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 8

  9. Performance Impact Estimation  Not an easy job  No assumptions hold for all cases  Application-dependent  Platform-dependent  Our approach: 1. Enumerate and analyze all feasible memory access patterns 2. Quantify and log local memory impacts for each MAP on each platform (in terms of bandwidth) 3. Model applications as (compositions of) MAPs 4. Quantify application’s gain by search and compose MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 9

  10. Our Approach MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 10

  11. Stage I: Quantification MAP Description: MAP=eMAP+iMAP 16 cases eMAP-14 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 11

  12. Stage I: Quantification MAP=eMAP+iMAP 16 cases d y =t y s = d x =t y +t x eMAP-14 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 12

  13. Stage I: Quantification 34 patterns MAP Description: MAP=eMAP+iMAP 16 cases 5 cases Single, Row, Column, Block, Neighbor MAP-14 Block (4) MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 13

  14. Stage I: Quantification Generating Benchmarks (MAP-407) Block (4), r=1 eMAP-07 d y =t x s = d x =t x Max vs. Min: MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 14

  15. Stage I: Quantification mbr=B/b Min/Max Comparison (MAP-407) better MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 15

  16. Stage I: Quantification Performance Database Overview MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 16

  17. Performance Database (MAP-407) GTX280 HD6970 GTX580 E5620 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 17

  18. Stage I: Quantification Performance Database Summary Access order change Data reuse MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 18

  19. Our Approach MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 19

  20. Stage II: A Query-based Performance Prediction  Kernel performance gain due to LM = memory bandwidth ratio before (b) and after (B) using LM  Predicting bandwidth when using LM  Identify MAPs (manually)  Query bandwidth information (B, b) from DB  Compose the bandwidths of individual MAPs  IC, MM, MT, SOR on GTX580 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 20

  21. Stage II: A Query-based Performance Prediction  Case I: MT, SOR  The kernel has one input matrix (and MAP)  Use the corresponding mbr in DB  Case II: MM  Case III: IC  Assume the filter is small and allocated on on-chip memory  Use mbr of MAP-408 MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 21

  22. Stage II: A Query-based Performance Prediction MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 22

  23. Conclusion  Quantifying the performance impact of using local memory on many-cores is possible  Not easy expected => well- known assumptions don’t always hold  MAP-based => application-agnostic  Query-based => prediction-friendly  Database-based => easy to extend  Composition-based => applicable for fairly complex kernels MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 23

  24. On-going Work  More MAPs and tests (on more diverse platforms, e.g. MIC)  Investigate further the performance interference between MAPs  An auto-tuner to automatically enable local memory MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 24

  25. Questions Jianbin Fang PhD student at TU Delft Email: j.fang@tudelft.nl WWW: http://www.pds.ewi.tudelft.nl/fang/ MuCoCoS'13: Quantifying the Performance Impacts of Using Local Memory 25

Recommend


More recommend