Exploiting Modern Hardware Features via Lightweight Profiling - PowerPoint PPT Presentation

Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Exploiting Modern Hardware Features via Lightweight Profiling 9

Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Sample Sample Sample Sample Exploiting Modern Hardware Features via Lightweight Profiling 9

Lightweight memory profiling • Hardware profiling • Event based sampling PMU • Intel (Precise event based sampling - PEBS) • AMD (Instruction based sampling - IBS) • IBM (Marked event sampling - MRK) Application Time Sample Sample Sample Sample {L1 miss, L2 hit etc.} Reference Type Data Address Instruction Exploiting Modern Hardware Features via Lightweight Profiling Pointer 9

Outline ✓ Lightweight profiling ✓ SMT-aware optimization • Detection of cache conflicts • Guiding data-structure layout transformation Exploiting Modern Hardware Features via Lightweight Profiling 10

SMT-Aware In Instantaneous Footprint Optimization [HPDC – 2016] Probir Roy , Shuaiwen Leon Song, Xu Liu Exploiting Modern Hardware Features via Lightweight Profiling 11

SMT (Simultaneous Multi-Threading) Thread 2 Thread 1 Idle Cycle Superscalar 2-way SMT Clock Cycles Exploiting Modern Hardware Features via Lightweight Profiling 12

SMT scalability Lower is better Shared memory SPMD application Runtime ratio Runtime ratio = SMT runtime / non-SMT runtime Exploiting Modern Hardware Features via Lightweight Profiling 13

SMT architecture: shared cache Core 1 Core 1 Thread 1 Thread 2 Thread 3 Thread 4 L1 Cache L1 Cache L2 Cache LLC Cache Exploiting Modern Hardware Features via Lightweight Profiling 14

SMT: Memory scalability Lower is better SMT scaling factor SMT scaling factor (F) = access Latency of SMT/ access Latency of non-SMT Exploiting Modern Hardware Features via Lightweight Profiling 15

Characterization based on sensitivity L = Memory Access Latency; F = scaling factor (L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, potentially sensitive LU, 3D tensor, Stencil, streamcluster2, to mem-centric SMT hotspot, Clomp optimizations (high, low) lud, needle, bfs, nn, bp, canneal, not clear if they can Ferret further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, little benefit from myocyte mem-centric SMT optimization (low, low) b+tree, cfd, kmeans, lavaMD, particle good memory filter, hotspot3D, blackscholes, performance with bodytrack, facesim, SMT enabled Swaptions Exploiting Modern Hardware Features via Lightweight Profiling 16

Source of memory contention Little/no locality Exploiting Modern Hardware Features via Lightweight Profiling 17

Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Exploiting Modern Hardware Features via Lightweight Profiling 17

Source of memory contention Little/no locality Intra-thread SMT thread 1 SMT thread 2 space time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 space time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 space space time time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 Cache line 1 space space Cache line 1 time time Optimization: Improve cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

Source of memory contention Little/no locality Intra-thread Inter-thread SMT thread 1 SMT thread 2 Cache line 1 space space Cache line 1 time time Optimization: Optimization: Improve Collaboration cache line utilization Exploiting Modern Hardware Features via Lightweight Profiling 17

SMT locality (Stencil code) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Exploiting Modern Hardware Features via Lightweight Profiling 18

SMT locality (Stencil code) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

SMT locality (Stencil code) schedule(static,1) #pragma omp parallel for for ( int i = T ; i < N - T ; i ++) for ( int j = T ; j < N - T ; j ++) for ( int k = 0 ; k < T ; k ++) R [ i ][ j ] = matrix [ i ][ j ] + matrix [ i - k ][ j ]+ matrix [ i ][ j - k ] + matrix [ i + k ][ j ]+ matrix [ i ][ j + k ]; Thread 1 Thread 1 Thread 2 Thread 2 Exploiting Modern Hardware Features via Lightweight Profiling 18

SMT-Analyzer: Analyzing memory access pattern Memory range Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

SMT-Analyzer: Analyzing memory access pattern PMU Memory range Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

SMT-Analyzer: Analyzing memory access pattern Application Time S S S S PMU Reference Type Memory range Data Address Instruction Pointer Thread Loop T4 T5 T1 T2 T3 threads Exploiting Modern Hardware Features via Lightweight Profiling 19

Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Exploiting Modern Hardware Features via Lightweight Profiling 20

Benchmarks Benchmarks bottleneck region % of total overhead OPT method Speedups latency lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72× Related work: MACPO (selective instrumentation): 2x - 5x Exploiting Modern Hardware Features via Lightweight Profiling 20

Outline ✓ Lightweight profiling ✓ SMT-aware optimization • Detection of cache conflicts • Guiding data-structure layout transformation Exploiting Modern Hardware Features via Lightweight Profiling 21

Lig ightweight Detection of f Cache Conflicts [CGO – 2018] Probir Roy , Shuaiwen Leon Song, Sriram Krishnamoorthy, Xu Liu Exploiting Modern Hardware Features via Lightweight Profiling 22

Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Exploiting Modern Hardware Features via Lightweight Profiling 23

Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23

Set-associative cache Cache Line Set 0 8 way Set 1 Intel Skylake . L1 cache: 32 KB . . Set 63 Address TAG SET Index Offset 64 Bits Exploiting Modern Hardware Features via Lightweight Profiling 23

Set conflict [0] [1] [2] [127] … [0] … [1] … [2] … … [20,000] double Array [20,000][128]; Exploiting Modern Hardware Features via Lightweight Profiling 24

Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Exploiting Modern Hardware Features via Lightweight Profiling 24

Set conflict [0] [1] [2] [127] … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

Set conflict [0] [1] [2] [127] Pad … [0] Set mapping … [1] … [2] 128 … 0 1 15 … [20,000] 16 17 31 32 33 47 double Array [20,000][128]; 48 49 63 0 1 15 16 17 31 … 48 49 63 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time 0 0 16 32 48 0 16 32 48 Exploiting Modern Hardware Features via Lightweight Profiling 24

Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time Time 0 0 16 32 48 0 16 32 48 8 4 21 48 55 51 0 17 34 Exploiting Modern Hardware Features via Lightweight Profiling 24

Set conflict [0] [1] [2] [127] Pad Set mapping … [0] Set mapping after padding … [1] … [2] 128 128 Pad … 0 1 15 0 1 16 15 … [20,000] 16 17 31 17 18 32 33 Is your application suffering conflict 32 33 47 34 35 49 50 double Array [20,000][128]; 48 49 63 51 52 2 3 cache miss? 0 1 15 4 5 19 20 16 17 31 21 22 36 37 … … 48 49 63 55 56 6 7 Time Time 0 0 16 32 48 0 16 32 48 8 4 21 48 55 51 0 17 34 Exploiting Modern Hardware Features via Lightweight Profiling 24

Trace driven cache simulation Simulation methods Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

Trace driven cache simulation Overhead: average 38 times Simulation methods Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache Difficult to simulate hardware L1 cache simulator Classifying miss Conflict cache miss Exploiting Modern Hardware Features via Lightweight Profiling 25

Trace driven cache simulation Overhead: average 38 times Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a Simulation methods higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356. Time High overhead A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Memory trace Cache Difficult to simulate hardware L1 cache simulator Classifying miss Conflict cache miss Theoretically accurate Difficult in practice Exploiting Modern Hardware Features via Lightweight Profiling 25

A practical low overhead solution Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

A practical low overhead solution CCProf Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

A practical low overhead solution CCProf Measurement methods Simulation methods Memory trace Cache simulator Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

A practical low overhead solution CCProf Measurement methods Simulation methods Memory Memory trace sampling Statistical Cache analysis simulator Classifying miss Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

A practical low overhead solution CCProf Measurement methods Simulation methods Overhead Memory >> Memory trace sampling Statistical Cache Accuracy analysis simulator ~ Classifying miss Classifying miss Exploiting Modern Hardware Features via Lightweight Profiling 26

Hardware-based address sampling (Cont.) Memory Time A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0] references L1 Miss L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss Exploiting Modern Hardware Features via Lightweight Profiling 27

Hardware-based address sampling (Cont.) Memory Time A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0] references L1 Miss L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss Precise event A[0][0] A[4][0] A[2][0] sampling (PEBS) PMU Exploiting Modern Hardware Features via Lightweight Profiling 27

Exploiting Modern Hardware Features via Lightweight Profiling - PowerPoint PPT Presentation

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools Workshop19 1 High performance and challenges IBM POWER 9 CPU Exploiting Modern Hardware Features via Lightweight Profiling 2 High performance and

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Hardware Observability Framework Hardware Observability Framework Hardware Observability

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

The lightweight beam for Heavyweight applications The impact of this lightweight beam concept

The lightweight beam for Heavyweight applications The impact of this lightweight steel beam will

Its time to Think Lightweight! www.thinklightweight.com TO D A Y S TO P IC S 1.

Lightweight Cryptography and and RFID Security Svetla Nikova COSIC KUL COSIC, KULeuven and

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan 1. Modern Hardware? 2.

Exploiting modern microarchitectures: Meltdown, Spectre, and other attacks Jon Masters, Computer

Exploiting modern microarchitectures: Meltdown, Spectre, and other attacks Jon Masters, Computer

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

5/24/10 Modern Hardware is Complex Modern systems built on layers of hardware Tamper Evident

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Performance of Host Identity Protocol on Performance of Host Identity Protocol on Lightweight

Quantifying Program Complexity and Comprehension Quantifying Program Complexity and Comprehension

Stratus: Clouds with Microarchitectural Resource Management Kaveh Razavi and Animesh Trivedi

HOW TO USE JAVA STREAMS TO ACCESS EXISTING DATA WITH ULTRA-LOW LATENCY PER MINBORG, CTO,

Designing Computer Systems for Software 2.0 Kunle Olukotun Stanford University SambaNova

CHERI JNI: Sinking the Java security model into the C David Chisnall , Brooks Davis, Khilan Gudka,

Participatory Networking: An API for Application Control of SDNs Andrew Ferguson, Arjun Guha,

@odin odinthe thener nerd not the god Auto-Intern GmbH 1 @odinthenerd A possible future

A Formally Verified Compiler for Lustre Timothy Bourke 1 , 2 Llio Brun 1 , 2 Pierre-variste