Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T aura The University of T okyo 2017-01-12 1
Performance Optimization ● Applications rarely reach peak performance of hardware ● Memory bandwidth is a bottleneck for many applications ● The problems originate from the interaction of software and hardware 2
DRAM Contention ● DRAM contention – More demand than resources available – Multiple cores compete over the single bandwidth resource ● Consumed bandwidth can not identify contention ● An example: An application uses 95% of the available bandwidth Good (No contention) ☹ Bad (Contention) The application gets everything – The application requests – it needs more than the DRAM can All of the available resources are provide – used 3
NUMA Systems Node 0 Node 1 Core Core Core Core ● Higher aggregated bandwidth Core Core Core Core ● Requires usage of all DRAMs DRAM DRAM Node 2 Node 3 Core Core Core Core Core Core Core Core DRAM DRAM 4
Contributions ● A method to identify DRAM contention and imbalanced NUMA resource usage – Difgerentiation from harmless high bandwidth – Precise attribution to instructions and objects – Contention severity metric – NUMA imbalance severity metric – Practical Lightweight profjler ● Single socket and NUMA systems supported ● Runs on default Linux OS ● Works on unmodifjed code with debug information ● 5
Contents ● Introduction ● Related Work ● Contention Detection Method ● Evaluation ● Case Studies ● Conclusion 2017-01-12 6
Related Work - Contention ● Measurement of consumed bandwidth [Intel PCM] [Intel Vtune] [Weyers et al. VPA 2014] – Does not identify contention – No precise attribution ● Latency as indicator for memory problems [Lachaize et al. USENIX ATC 2012] [Liu et al. SC 2013] [Liu et al. PPoPP 2014] [Liu et al. SC 15] – Does not identify contention – Precise attribution through instruction sampling ● Performance counter based detection [Yasin ISPASS 2014] [Molka et al. ICPE 2017] [Eklov et al. CGO 2013] [Eyerman ISPASS 2012] – Performance counters that identify bandwidth boundness and exclude other problems No precise attribution – ● Instruction sampling based [Xu et al. IPDPS 2017] Machine learning approach based on latency and other features – Only NUMA remote memory , no local memory contention, no single socket systems – Severity can not be quantifjed – 7
Related Work - NUMA ● Show the location of allocation , fjrst touch , use of data [Liu et al. PPoPP 2014] – No quantifjcation of imbalance ● Visual detection of imbalance [Gimenez et al. SC 2014] [Gimenez et al. TVCG 2017] [Trahay et al. ICPP 2018] ● OS extension with imbalance metric [Fedorova et al. ASPLOS 2013] – Standard deviation of load across nodes 8
Contents ● Introduction ● Related Work ● Contention Detection Method ● Evaluation ● Case Studies ● Conclusion 2017-01-12 9
Relation of Bandwidth And Latency ● Known in queuing theory No queuing delay Only In-store processing time Arrivals Queuing delay + Arrivals In-store processing time ● Application to DRAM Only DRAM processing time Application DRAM Queuing delay + Application DRAM DRAM processing time 10
Latency of an Application ● Application latency measurement with instruction sampling ● Selected samples Exclude Caches ID IP Data Latency Memory TLB Locked Address Level 0 a aa 50 L2 hit No Exclude TLB Miss 1 b bb 330 DRAM miss No 2 c cc 600 DRAM hit Yes 3 d dd 300 DRAM hit No Exclude Atomic Access 4 e ee 290 DRAM hit No C. Helm and K. Taura, Average latency of On The Correct Measurement of Application Memory Bandwidth and Memory Access Latency , at least 25 samples Precise attribution HPC Asia 2020 11
Relative Latency Metric ● System latency Uncontended DRAM access – Determined with pointer chasing benchmark – ● Relative latency = Application latency System latency ● Hardware independent severeness of DRAM contention ● Higher than one indicates contention problem 12
NUMA Imbalance Metric Local or remote access ID CPU ID Memory Level Origin node 0 0 Local DRAM 1 0 Remote DRAM 3 1 Local DRAM 4 1 Local DRAM ● Local ratio (per node) = Number of local accesses Number of total DRAM accesses ● Numa Imbalance = max (local ratio) – min (local ratio) ● 1 → High imbalance ● 0 → Low imbalance 13
Profjling T ool Implementation Allocation Tracker Data Sqlite Analyzer Run Script Merger Database Linux Perf Profjled Application PerfMemPlus is available online: https://github.com/helchr/PerfMemPlus 14
Contents ● Introduction ● Related Work ● Contention Detection Method ● Evaluation ● Case Studies ● Conclusion 2017-01-12 15
Hardware Setup Name Architecture CPUs DRAM Bandwidth 2 Arcturus Broadwell 2x E5-2699v4 43GB/s 3 Comet Haswell 2x E5-2699v3 32GB/s Rigel Skylake 2x Xeon 8176 77GB/s 1 Spica Broadwell 4x E7-8890v4 25GB/s 4 16
Experiment Design ● A defjned amount of contention to compare with the detection results – Benchmark to create an adjustable amount of contention – A quantifjcation of the severity of contention 17
Adjustable Contention Benchmarks ● Simple memory intensive parallel vector operations [Xu et al. IPDPS 2017] Countv – Dotv – Sumv – ● T wo data sizes Smaller than L3 cache – Optimal case when DRAM ● bandwidth is no limitation Larger than L3 cache – The DRAM bandwidth limitation ● has an impact ● Variable number of threads More threads will increase the bandwidth requirement – 18
Contention Quantifjcation ● Speedup Loss = Speedup (Small array version) Speedup (Large array version) ● Expresses the severity of the DRAM contention 19
Detection Results ● Each experiment is repeated 10 times ● The percentage of correct detection is recorded Upper boundary of speedup loss interval 20
Advantage of Latency over Bandwidth ● Compare information from three sources – Bandwidth – Latency – Speedup loss 21
Information From Bandwidth All benchmarks suffer from limited DRAM bandwidth 22
Information From Latency DRAM access latency in uncontended state 23
Information From Latency DRAM contention problem differs between benchmarks 24
True Information DRAM contention problem differs between benchmarks 25
Contents ● Introduction ● Related Work ● Contention Detection Method ● Evaluation ● Case Studies ● Conclusion 2017-01-12 26
Applications ● All 13 PARSEC benchmarks ● N3LP – Neural machine translation [Eriguchi et al., WAT, 2016] [https://github.com/hassyGo/N3LP] – Implemented using Eigen library [http://eigen.tuxfamily.org] 27
Bandwidth Contention Details 5 arcturus comet 4 rigel Relative Latency spica 3 contention 4 2 4 2 3 2 3 1 1 1 0 streamcluster canneal n3lp Benchmark 28
NUMA Imbalance 1 arcturus 0.8 comet NUMA Imbalance rigel NUMA imbalance problem 0.6 spica 0.4 0.2 0 streamcluster canneal n3lp Small NUMA imbalance problem 29
Interleaved Allocation Speedup ● High relative latency ● High relative latency only on Spica ● High NUMA imbalance ● High NUMA imbalance on all systems ● Interleave allocation ➔ Speedup only on Spica ➔ Large Speedup 4 70 arcturus 60 Speedup % 4 3 50 comet 2 40 rigel 30 spica 20 1 10 0 ● High relative latency streamcluster canneal n3lp ● Low NUMA imbalance ➔ No or low speedup 30
Profjling Overhead (PARSEC) 31
Conclusion A new method to identify DRAM contention Relative NUMA Performance Problem Latency Imbalance Low Any No DRAM contention problem High Low Contention but not NUMA related High High Contention due to inefficient NUMA usage ● Future work – Include more hardware related reasons of DRAM contention 32
Recommend
More recommend