Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Exploiting Modern Hardware Features via Lightweight Profiling 9
Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Exploiting Modern Hardware Features via Lightweight Profiling 9
Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Sample Sample Sample Sample Exploiting Modern Hardware Features via Lightweight Profiling 9
Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Sample Sample Sample Sample {L1 miss, L2 hit etc.} Reference Type Data Address Instruction Exploiting Modern Hardware Features via Lightweight Profiling Pointer 9
Outline ✓ Lightweight profiling ✓ SMT-aware optimization • Detection of cache conflicts • Guiding data-structure layout transformation Exploiting Modern Hardware Features via Lightweight Profiling 10
SMT-Aware In Instantaneous Footprint Optimization [HPDC – 2016] Probir Roy , Shuaiwen Leon Song, Xu Liu Exploiting Modern Hardware Features via Lightweight Profiling 11
SMT (Simultaneous Multi-Threading) Thread 2 Thread 1 Idle Cycle Superscalar 2-way SMT Clock Cycles Exploiting Modern Hardware Features via Lightweight Profiling 12
SMT scalability Lower is better Shared memory SPMD application Runtime ratio Runtime ratio = SMT runtime / non-SMT runtime Exploiting Modern Hardware Features via Lightweight Profiling 13
SMT architecture: shared cache Core 1 Core 1 Thread 1 Thread 2 Thread 3 Thread 4 L1 Cache L1 Cache L2 Cache LLC Cache Exploiting Modern Hardware Features via Lightweight Profiling 14
SMT: Memory scalability Lower is better SMT scaling factor SMT scaling factor (F) = access Latency of SMT/ access Latency of non-SMT Exploiting Modern Hardware Features via Lightweight Profiling 15
Characterization based on sensitivity L = Memory Access Latency; F = scaling factor (L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, potentially sensitive LU, 3D tensor, Stencil, streamcluster2, to mem-centric SMT hotspot, Clomp optimizations (high, low) lud, needle, bfs, nn, bp, canneal, not clear if they can Ferret further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, little benefit from myocyte mem-centric SMT optimization (low, low) b+tree, cfd, kmeans, lavaMD, particle good memory filter, hotspot3D, blackscholes, performance with bodytrack, facesim, SMT enabled Swaptions Exploiting Modern Hardware Features via Lightweight Profiling 16
Characterization based on sensitivity L = Memory Access Latency; F = scaling factor (L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, potentially sensitive LU, 3D tensor, Stencil, streamcluster2, to mem-centric SMT hotspot, Clomp optimizations (high, low) lud, needle, bfs, nn, bp, canneal, not clear if they can Ferret further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, little benefit from myocyte mem-centric SMT optimization (low, low) b+tree, cfd, kmeans, lavaMD, particle good memory filter, hotspot3D, blackscholes, performance with bodytrack, facesim, SMT enabled Swaptions Exploiting Modern Hardware Features via Lightweight Profiling 16
Source of memory contention Little/no locality Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 space time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 space space time time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 Cache line 1 space space Cache line 1 time time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17
Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 Cache line 1 space space Cache line 1 time time Optimization: Optimization: Improve Collaboration cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17
SMT locality (Stencil code) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Exploiting Modern Hardware Features via Lightweight Profiling 18
SMT locality (Stencil code) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18
SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18
SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 1 Thread 2 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18
SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 1 Thread 2 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18
SMT-Analyzer: Analyzing memory access pattern Memory range Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19
SMT-Analyzer: Analyzing memory access pattern PMU Memory range Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19
SMT-Analyzer: Analyzing memory access pattern Application Time S S S S PMU Reference Type Memory range Data Address Instruction Pointer Thread Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19
SMT-Analyzer: Analyzing memory access pattern Application Time S S S S PMU Reference Type Memory range Data Address Instruction Pointer Thread Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19
Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20
Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20
Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20
Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20
Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Related work: MACPO (selective instrumentation): 2x - 5x Exploiting Modern Hardware Features via Lightweight Profiling 20
Outline ✓ Lightweight profiling ✓ SMT-aware optimization • Detection of cache conflicts • Guiding data-structure layout transformation Exploiting Modern Hardware Features via Lightweight Profiling 21
Lig ightweight Detection of f Cache Conflicts [CGO – 2018] Probir Roy , Shuaiwen Leon Song, Sriram Krishnamoorthy, Xu Liu Exploiting Modern Hardware Features via Lightweight Profiling 22
Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Exploiting Modern Hardware Features via Lightweight Profiling 23
Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23
Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address TAG SET Index Offset 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23
Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address TAG SET Index Offset 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23
Set conflict [0] [1] [2] [127] … [0] … [1] … [2] … … [20,000] double Array [20,000][128]; Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] Pad … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time Time 0 0 16 32 48 0 16 32 48 8 4 21 48 55 51 0 17 34 Exploiting Modern Hardware Features via Lightweight Profiling 24
Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 Is your application suffering conflict 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 cache miss? 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time Time 0 0 16 32 48 0 16 32 48 8 4 21 48 55 51 0 17 34 Exploiting Modern Hardware Features via Lightweight Profiling 24
Trace driven cache simulation Simulation methods Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25
Trace driven cache simulation Overhead: average 38 times Simulation methods Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25
Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25
Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25
Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache Difficult to simulate hardware L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25
Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache Difficult to simulate hardware L1 cache simulator Classifying miss Conflict cache miss Theoretically accurate Difficult in practice Exploiting Modern Hardware Features via Lightweight Profiling 25
A practical low overhead solution Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26
A practical low overhead solution CCProf Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26
A practical low overhead solution CCProf Measurement methods Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26
A practical low overhead solution CCProf Measurement methods Simulation methods Memory Memory trace sampling Statistical Cache analysis simulator Classifying miss Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26
A practical low overhead solution CCProf Measurement methods Simulation methods Overhead Memory >> Memory trace sampling Statistical Cache Accuracy analysis simulator ~ Classifying miss Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26
Hardware-based address sampling (Cont.) Memory Time A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0] references L1 Miss L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss Exploiting Modern Hardware Features via Lightweight Profiling 27
Hardware-based address sampling (Cont.) Memory Time A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0] references L1 Miss L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss Precise event A[0][0] A[4][0] A[2][0] sampling (PEBS) PMU Exploiting Modern Hardware Features via Lightweight Profiling 27
Recommend
More recommend