Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark Santaniello James Larus Performance & Performance & EPFL IC School Dean Capacity Intern Capacity Engineer 1
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 2
40 GiB of DRAM wasted per server 3
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 4
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 5
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(keys.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 6
LLVM-Based Profiler Memoro Sanitizers LLVM 7
LLVM-Based Profiler Memoro Sanitizers LLVM Manipulate the IR 7
LLVM-Based Profiler Memoro Infrastructure Sanitizers LLVM Manipulate the IR 7
LLVM-Based Profiler Collecting and Displaying data Memoro Infrastructure Sanitizers LLVM Manipulate the IR 7
Run-Time Overhead Memoro + Visualizer Open Challenges 8
Overview Compile Run Analyze Source Code 9
Overview Compile Run Analyze Source Code No modification 9
Overview INSTRUMENTATION PASS (LLVM) Compile Run Analyze Source Code Instrument loads/stores No modification Instrument intrinsics Collect types 9
Overview RUN-TIME INSTRUMENTATION PASS (LLVM) (COMPILER-RT) Compile Run Analyze Source Code Instrument loads/stores Intercept alloc/free No modification Instrument intrinsics Intercept loads/stores Collect types Intercept syscalls Collect stats 9
Overview RUN-TIME INSTRUMENTATION PASS VISUALIZER (LLVM) (COMPILER-RT) (ELECTRON) Compile Run Analyze Source Code Instrument loads/stores Intercept alloc/free Score AP No modification Instrument intrinsics Intercept loads/stores Guide exploration Collect types Intercept syscalls Collect stats 9
Overview RUN-TIME INSTRUMENTATION PASS VISUALIZER (LLVM) (COMPILER-RT) (ELECTRON) Compile Run Analyze Source Code Instrument loads/stores Intercept alloc/free Score AP No modification Instrument intrinsics Intercept loads/stores Guide exploration Collect types Intercept syscalls Collect stats 9
Run-Time Overhead Memoro + Visualizer Open Challenges 10
Run-Time Overhead Memoro + Visualizer Open Challenges 10
1,000x slowdown due to Memoro's run-time 11
Run-Time Sampling int sample_count = 0; void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return; /* Process access... */ } 12
Run-Time Sampling THREADLOCAL int sample_count = 0; void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return; /* Process access... */ } 12
Power to the user! MEMORO_OPTIONS="…" ./myapp - access_sampling_rate - ... // Public API: memoro_interface.h #include <memoro_interface.h> void foo(…) { MemoroFlags *mflags = memoro::getFlags(); mflags->access_sampling_rate = 50; /* ... */ } 13
🕶 99% 14
🕶 Time spent by address type 99% 0% 25% 50% 75% 100% Primary Heap Secondary Heap Not Heap 14
🕶 Time spent by address type 99% 0% 25% 50% 75% 100% Primary Heap Secondary Heap Stack 14
The Allocators 🕶 ld 0x… Primary Secondary − large allocations O(1) O(n) Metadata Addr Size First Access Time Access Range Low … 15
The Allocators 🕶 ld 0x… Primary Secondary − large allocations O(1) O(n) 🔓 Metadata Addr Size First Access Time Access Range Low … 15
Issue with non-heap addresses 🕶 Stack Heap … 16
Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap Heap … 16
Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap 2. Traverse all allocations to discard them Heap … 16
Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap 2. Traverse all allocations to discard them Heap 3. Takes a global lock … 16
Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap 2. Traverse all allocations to discard them Heap 3. Takes a global lock … 0% 25% 50% 75% 100% Primary Heap Secondary Heap Stack 16
Run-Time Filter Stack Heap … 17
Run-Time Filter 0xABCD Stack 1. Thread start : store stack top Heap … 17
Run-Time Filter 0xABCD Stack 1. Thread start : store stack top 0xAAAA 2. Get current stack bottom Heap … 17
Run-Time Filter 0xABCD 0xAABB Stack 1. Thread start : store stack top 0xAAAA 2. Get current stack bottom 0x1234 Heap 3. Discard if Addr. in this range … 17
0.58% Time spent by address type 0.58% 99% <2% 0% 25% 50% 75% 100% Primary Heap Secondary Heap Not heap Stack Filtered 18
1,000x slowdown due to Memoro's run-time 19
5x slowdown due to Memoro's run-time 20
Run-Time Overhead Memoro + Visualizer Open Challenges 21
Run-Time Overhead Memoro + Visualizer Open Challenges 21
+ 100,000 Stack Traces + 1B Allocations 22
23
Truncate 40% 30% Score 20% 10% 0% 100 300 1k 3k 10k 30k 100k Bin Size 24
Truncate 40% HIDE 30% Score 20% 10% 0% 100 300 1k 3k 10k 30k 100k Bin Size 24
25
BEFORE AFTER 25
Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . main() main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() bar() . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26
Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26
Death by a thousand cuts main() main() . . . . . . foo() bar() 27
Memoro + 28
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 29
Demo 30
Run-Time Overhead Memoro + Visualizer Open Challenges 31
Dumping Profile Your regular service 32
Dumping Profile AtExit() Your regular service 32
Dumping Profile AtExit() Your regular service 32
Dumping Profile AtExit() Facebook service 32
Dumping Profile AtExit() Facebook service 32
Dumping Profile AtExit() Facebook service 32
Dumping Profile AtExit() Facebook service call lldb AtExit() 32
Dumping Profile AtExit() Facebook service call lldb AtExit() 32
Dumping Profile AtExit() Facebook service call lldb AtExit() Signal to dump ( SIGPROF ) a. 32
Dumping Profile AtExit() Facebook service call lldb AtExit() Signal to dump ( SIGPROF ) a. b. Ring buffer + Periodic write 32
Compile-Time Stack Analysis 33
Compile-Time Stack Analysis ld/st 33
Compile-Time Stack Analysis ld/st llvm::GetUnderlyingObject() 33
Compile-Time Stack Analysis 90000 Ratio Instrumented load/store 67500 45000 22500 0 0 1 2 4 8 ld/st GetUnderlyingObject(depth = X) llvm::GetUnderlyingObject() 33
Compile-Time Stack Analysis foo() 90000 Ratio Instrumented load/store bar() 67500 45000 22500 0 0 1 2 4 8 ld/st GetUnderlyingObject(depth = X) llvm::GetUnderlyingObject() 33
Compile-Time Stack Analysis foo() 90000 Ratio Instrumented load/store bar() 67500 45000 22500 0 0 1 2 4 8 ld/st GetUnderlyingObject(depth = X) llvm::GetUnderlyingObject() 33
Thank you! github.com/epfl-vlsc/memoro Thierry Treyer Mark Santaniello James Larus Performance & Performance & EPFL IC School Dean Capacity Intern Capacity Engineer 34
Recommend
More recommend