map driven performance analysis for local memory usage
play

MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang - PowerPoint PPT Presentation

The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the


  1. The 2nd International Workshop on OpenCL (IWOCL'14) MAP-Driven Performance Analysis for Local Memory Usage Jianbin Fang + , Henk Sips + , Ana Lucia Varbanescu ++ + Delft University of Technology, the Netherlands ++ University of Amsterdam, the Netherlands May 12-13, 2014 Bristol, England 1

  2. Outline  Introducing local memory (background)  Our research question (why)  Our approach (how)  Our key findings (results) 2

  3. OpenCL and Local Memory  Like-a-cache: on-chip and faster  Not-a-Cache: user-managed  Data elements are shared by work-items of a work-group 3

  4. Rules of Thumb  Using local memory on GPUs is preferred (e.g., data reuse)  Using local memory on CPUs is not recommended 4

  5. The Reality is ...  A counter-intuitive example  3x3 convolution  On Intel Xeon E5620 (6 cores) Better 16 14 12 Bandwidth (GB/s) 10 8 w/o w/ 6 4 2 0 128x128 256x256 512x512 1024x1024 2048x2048 Dataset 5

  6. The Reality is ...  A counter-intuitive example  3x3 convolution  On Intel Xeon E5620 (6 cores) Better LM on CPUs ≠ Performance loss 16 14 12 Bandwidth (GB/s) 10 8 w/o w/ 6 4 2 0 128x128 256x256 512x512 1024x1024 2048x2048 Dataset 6

  7. When to Use Local Memory? 7

  8. Our Approach 8

  9. MAP Description  OpenCL organizes parallelism at two levels  We describe a MAP at two levels  eMAP: work-group level  iMAP: work-item level iMAP eMAP 9

  10. 33 MAP Citizens iMAP eMAP  eMAP  M00, M01, M10, M11 ∈ {0,1}  iMAP  Single, Row, Column, Block, Neighbor 10

  11. Micro-Benchmarks  We generate 2 micro-benchmarks (in OCL) for each MAP  The kernel code  Local space allocation  Local data staging  Local data access (specified by the MAP)  We provides a tool (Aristotle*) to facilitate this process *https://github.com/haibo031031/aristotle 11

  12. Experimental Platforms  Devices  SPM-only: NVIDIA C1060  SPM+Cache: AMD HD7970, NVIDIA C2050, K20  Cache-only: Intel Xeon X5650. E5-2620, Intel Phi 5110P  Software environments  AMD APP v2.8  Intel OpenCL SDK v3.0  NVIDIA CUDA v5.5 (not updated for long time ) 12

  13. Experimental Setup  Metrics: bandwidth  Datasets  128, 256, 512, 1024, 2048, 4096  Block MAPs: r=3  Run each measurement for 21 iterations  1 iteration to warm-up  Measure 20 iterations  Flush caches between iterations 13

  14. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 14

  15. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 15

  16. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 16

  17. A Bird's-eye View Performance gain/loss distribution 100% 90% 80% 70% 60% similar loss 50% gain 40% 30% 20% 10% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650 Devices NVIDIA GPUs AMD GPU Intel CPUs & Phi Datasets: 4096x4096 17

  18. 100% 80% SPM Processors 60% 40% 20% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650  Memory bandwidth increase factors  Data reuse (A)  Changed memory access orders (B) 18

  19. SPM Processors: w/o Caches VS. w/ Caches Better  The performance gain canceled  Disabling local memory is better  MAP-514 w/ LM w/o LM C1060 19 C2050 K20

  20. 100% 80% Cache-only Processors 60% 40% 20% 0% C1060 C2050 K20 HD7970 Phi-5110p E5-2620 X5650  Emulating local memory on global memory  Using local memory might utilize caches better  MAP-302 w/ LM Better w/o LM Phi-5110P E5-2620 X5650 20

  21. Cache-only Processors  Using local memory on MAP-302 leads to a BW increase  Profile the number of cache-line replacements on E5-2620 Better L1 cache replacement L2 cache replacement 21

  22. Performance Database Use-Scenario 22

  23. Summary  Data reuse and access order changes are positive factors  Unpredictable local memory performance is due to caches  Our query-based approach to decide local memory usage  Architecture design indications  SPM and caches co-exist !? 23

  24. Follow-up Questions  Evaluations on more platforms, e.g., tiny GPUs  How to identify MAPs for a given kernel?  Visual inspection  Automatic tools  How to 'predict' the performance impact of using local memory in the presence of multiple MAPs?  How to enable/disable local memory usage?  Enabler*: w/o → w/  Disabler**: remove the use of local memory *J. Fang, et al., "ELMO: A User-Friendly API to enable local memory in OpenCL kernels," in PDP2013. **J. Fang, et al., "Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels," ICPP2014 (in submission). 24

  25. Questions Jianbin Fang PhD student at TU Delft j.fang@tudelft.nl Jianbin Fang Ana Lucia Varbanescu Henk Sips 25

Recommend


More recommend