exploiting modern hardware features via lightweight
play

Exploiting Modern Hardware Features via Lightweight Profiling - PowerPoint PPT Presentation

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools Workshop19 1 High performance and challenges IBM POWER 9 CPU Exploiting Modern Hardware Features via Lightweight Profiling 2 High performance and


  1. Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Exploiting Modern Hardware Features via Lightweight Profiling 9

  2. Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Exploiting Modern Hardware Features via Lightweight Profiling 9

  3. Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Sample Sample Sample Sample Exploiting Modern Hardware Features via Lightweight Profiling 9

  4. Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Sample Sample Sample Sample {L1 miss, L2 hit etc.} Reference Type Data Address Instruction Exploiting Modern Hardware Features via Lightweight Profiling Pointer 9

  5. Outline ✓ Lightweight profiling ✓ SMT-aware optimization • Detection of cache conflicts • Guiding data-structure layout transformation Exploiting Modern Hardware Features via Lightweight Profiling 10

  6. SMT-Aware In Instantaneous Footprint Optimization [HPDC – 2016] Probir Roy , Shuaiwen Leon Song, Xu Liu Exploiting Modern Hardware Features via Lightweight Profiling 11

  7. SMT (Simultaneous Multi-Threading) Thread 2 Thread 1 Idle Cycle Superscalar 2-way SMT Clock Cycles Exploiting Modern Hardware Features via Lightweight Profiling 12

  8. SMT scalability Lower is better Shared memory SPMD application Runtime ratio Runtime ratio = SMT runtime / non-SMT runtime Exploiting Modern Hardware Features via Lightweight Profiling 13

  9. SMT architecture: shared cache Core 1 Core 1 Thread 1 Thread 2 Thread 3 Thread 4 L1 Cache L1 Cache L2 Cache LLC Cache Exploiting Modern Hardware Features via Lightweight Profiling 14

  10. SMT: Memory scalability Lower is better SMT scaling factor SMT scaling factor (F) = access Latency of SMT/ access Latency of non-SMT Exploiting Modern Hardware Features via Lightweight Profiling 15

  11. Characterization based on sensitivity L = Memory Access Latency; F = scaling factor (L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, potentially sensitive LU, 3D tensor, Stencil, streamcluster2, to mem-centric SMT hotspot, Clomp optimizations (high, low) lud, needle, bfs, nn, bp, canneal, not clear if they can Ferret further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, little benefit from myocyte mem-centric SMT optimization (low, low) b+tree, cfd, kmeans, lavaMD, particle good memory filter, hotspot3D, blackscholes, performance with bodytrack, facesim, SMT enabled Swaptions Exploiting Modern Hardware Features via Lightweight Profiling 16

  12. Characterization based on sensitivity L = Memory Access Latency; F = scaling factor (L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, potentially sensitive LU, 3D tensor, Stencil, streamcluster2, to mem-centric SMT hotspot, Clomp optimizations (high, low) lud, needle, bfs, nn, bp, canneal, not clear if they can Ferret further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, little benefit from myocyte mem-centric SMT optimization (low, low) b+tree, cfd, kmeans, lavaMD, particle good memory filter, hotspot3D, blackscholes, performance with bodytrack, facesim, SMT enabled Swaptions Exploiting Modern Hardware Features via Lightweight Profiling 16

  13. Source of memory contention Little/no locality Exploiting Modern Hardware Features via Lightweight Profiling 17

  14. Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Exploiting Modern Hardware Features via Lightweight Profiling 17

  15. Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Exploiting Modern Hardware Features via Lightweight Profiling 17

  16. Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

  17. Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 space time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

  18. Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 space space time time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

  19. Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 Cache line 1 space space Cache line 1 time time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

  20. Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 Cache line 1 space space Cache line 1 time time Optimization: Optimization: Improve Collaboration cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

  21. SMT locality (Stencil code) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Exploiting Modern Hardware Features via Lightweight Profiling 18

  22. SMT locality (Stencil code) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

  23. SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

  24. SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 1 Thread 2 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

  25. SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 1 Thread 2 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

  26. SMT-Analyzer: Analyzing memory access pattern Memory range Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

  27. SMT-Analyzer: Analyzing memory access pattern PMU Memory range Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

  28. SMT-Analyzer: Analyzing memory access pattern Application Time S S S S PMU Reference Type Memory range Data Address Instruction Pointer Thread Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

  29. SMT-Analyzer: Analyzing memory access pattern Application Time S S S S PMU Reference Type Memory range Data Address Instruction Pointer Thread Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

  30. Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20

  31. Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20

  32. Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20

  33. Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20

  34. Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Related work: MACPO (selective instrumentation): 2x - 5x Exploiting Modern Hardware Features via Lightweight Profiling 20

  35. Outline ✓ Lightweight profiling ✓ SMT-aware optimization • Detection of cache conflicts • Guiding data-structure layout transformation Exploiting Modern Hardware Features via Lightweight Profiling 21

  36. Lig ightweight Detection of f Cache Conflicts [CGO – 2018] Probir Roy , Shuaiwen Leon Song, Sriram Krishnamoorthy, Xu Liu Exploiting Modern Hardware Features via Lightweight Profiling 22

  37. Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Exploiting Modern Hardware Features via Lightweight Profiling 23

  38. Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23

  39. Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address TAG SET Index Offset 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23

  40. Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address TAG SET Index Offset 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23

  41. Set conflict [0] [1] [2] [127] … [0] … [1] … [2] … … [20,000] double Array [20,000][128]; Exploiting Modern Hardware Features via Lightweight Profiling 24

  42. Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Exploiting Modern Hardware Features via Lightweight Profiling 24

  43. Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Exploiting Modern Hardware Features via Lightweight Profiling 24

  44. Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

  45. Set conflict [0] [1] [2] [127] Pad … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

  46. Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

  47. Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

  48. Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time Time 0 0 16 32 48 0 16 32 48 8 4 21 48 55 51 0 17 34 Exploiting Modern Hardware Features via Lightweight Profiling 24

  49. Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 Is your application suffering conflict 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 cache miss? 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time Time 0 0 16 32 48 0 16 32 48 8 4 21 48 55 51 0 17 34 Exploiting Modern Hardware Features via Lightweight Profiling 24

  50. Trace driven cache simulation Simulation methods Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

  51. Trace driven cache simulation Overhead: average 38 times Simulation methods Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

  52. Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

  53. Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

  54. Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache Difficult to simulate hardware L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

  55. Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache Difficult to simulate hardware L1 cache simulator Classifying miss Conflict cache miss Theoretically accurate Difficult in practice Exploiting Modern Hardware Features via Lightweight Profiling 25

  56. A practical low overhead solution Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

  57. A practical low overhead solution CCProf Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

  58. A practical low overhead solution CCProf Measurement methods Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

  59. A practical low overhead solution CCProf Measurement methods Simulation methods Memory Memory trace sampling Statistical Cache analysis simulator Classifying miss Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

  60. A practical low overhead solution CCProf Measurement methods Simulation methods Overhead Memory >> Memory trace sampling Statistical Cache Accuracy analysis simulator ~ Classifying miss Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

  61. Hardware-based address sampling (Cont.) Memory Time A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0] references L1 Miss L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss Exploiting Modern Hardware Features via Lightweight Profiling 27

  62. Hardware-based address sampling (Cont.) Memory Time A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0] references L1 Miss L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss Precise event A[0][0] A[4][0] A[2][0] sampling (PEBS) PMU Exploiting Modern Hardware Features via Lightweight Profiling 27

Recommend


More recommend