memoro
play

Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark - PowerPoint PPT Presentation

Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark Santaniello James Larus Performance & Performance & EPFL IC School Dean Capacity Intern Capacity Engineer 1 vector<BigT> getValues( map<Id, BigT>&


  1. Memoro Scaling an LLVM-Based Heap Profiler Thierry Treyer Mark Santaniello James Larus Performance & Performance & EPFL IC School Dean Capacity Intern Capacity Engineer 1

  2. vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 2

  3. 40 GiB of DRAM wasted per server 3

  4. vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 4

  5. vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 5

  6. vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(keys.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 6

  7. LLVM-Based Profiler Memoro Sanitizers LLVM 7

  8. LLVM-Based Profiler Memoro Sanitizers LLVM Manipulate the IR 7

  9. LLVM-Based Profiler Memoro Infrastructure Sanitizers LLVM Manipulate the IR 7

  10. LLVM-Based Profiler Collecting and Displaying data Memoro Infrastructure Sanitizers LLVM Manipulate the IR 7

  11. Run-Time Overhead Memoro + Visualizer Open Challenges 8

  12. Overview Compile Run Analyze Source Code 9

  13. Overview Compile Run Analyze Source Code No modification 9

  14. Overview INSTRUMENTATION PASS (LLVM) Compile Run Analyze Source Code Instrument loads/stores No modification Instrument intrinsics Collect types 9

  15. Overview RUN-TIME INSTRUMENTATION PASS (LLVM) (COMPILER-RT) Compile Run Analyze Source Code Instrument loads/stores Intercept alloc/free No modification Instrument intrinsics Intercept loads/stores Collect types Intercept syscalls Collect stats 9

  16. Overview RUN-TIME INSTRUMENTATION PASS VISUALIZER (LLVM) (COMPILER-RT) (ELECTRON) Compile Run Analyze Source Code Instrument loads/stores Intercept alloc/free Score AP No modification Instrument intrinsics Intercept loads/stores Guide exploration Collect types Intercept syscalls Collect stats 9

  17. Overview RUN-TIME INSTRUMENTATION PASS VISUALIZER (LLVM) (COMPILER-RT) (ELECTRON) Compile Run Analyze Source Code Instrument loads/stores Intercept alloc/free Score AP No modification Instrument intrinsics Intercept loads/stores Guide exploration Collect types Intercept syscalls Collect stats 9

  18. Run-Time Overhead Memoro + Visualizer Open Challenges 10

  19. Run-Time Overhead Memoro + Visualizer Open Challenges 10

  20. 1,000x slowdown due to Memoro's run-time 11

  21. Run-Time Sampling int sample_count = 0; void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return; /* Process access... */ } 12

  22. Run-Time Sampling THREADLOCAL int sample_count = 0; void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return; /* Process access... */ } 12

  23. Power to the user! MEMORO_OPTIONS="…" ./myapp - access_sampling_rate - ... // Public API: memoro_interface.h #include <memoro_interface.h> void foo(…) { MemoroFlags *mflags = memoro::getFlags(); mflags->access_sampling_rate = 50; /* ... */ } 13

  24. 🕶 99% 14

  25. 🕶 Time spent by address type 99% 0% 25% 50% 75% 100% Primary Heap Secondary Heap Not Heap 14

  26. 🕶 Time spent by address type 99% 0% 25% 50% 75% 100% Primary Heap Secondary Heap Stack 14

  27. The Allocators 🕶 ld 0x… Primary Secondary − large allocations O(1) O(n) Metadata Addr Size First Access Time Access Range Low … 15

  28. The Allocators 🕶 ld 0x… Primary Secondary − large allocations O(1) O(n) 🔓 Metadata Addr Size First Access Time Access Range Low … 15

  29. Issue with non-heap addresses 🕶 Stack Heap … 16

  30. Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap Heap … 16

  31. Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap 2. Traverse all allocations to discard them Heap … 16

  32. Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap 2. Traverse all allocations to discard them Heap 3. Takes a global lock … 16

  33. Issue with non-heap addresses 🕶 Stack 1. Allocators only know about heap 2. Traverse all allocations to discard them Heap 3. Takes a global lock … 0% 25% 50% 75% 100% Primary Heap Secondary Heap Stack 16

  34. Run-Time Filter Stack Heap … 17

  35. Run-Time Filter 0xABCD Stack 1. Thread start : store stack top Heap … 17

  36. Run-Time Filter 0xABCD Stack 1. Thread start : store stack top 0xAAAA 2. Get current stack bottom Heap … 17

  37. Run-Time Filter 0xABCD 0xAABB Stack 1. Thread start : store stack top 0xAAAA 2. Get current stack bottom 0x1234 Heap 3. Discard if Addr. in this range … 17

  38. 0.58% Time spent by address type 0.58% 99% <2% 0% 25% 50% 75% 100% Primary Heap Secondary Heap Not heap Stack Filtered 18

  39. 1,000x slowdown due to Memoro's run-time 19

  40. 5x slowdown due to Memoro's run-time 20

  41. Run-Time Overhead Memoro + Visualizer Open Challenges 21

  42. Run-Time Overhead Memoro + Visualizer Open Challenges 21

  43. + 100,000 Stack Traces + 1B Allocations 22

  44. 23

  45. Truncate 40% 30% Score 20% 10% 0% 100 300 1k 3k 10k 30k 100k Bin Size 24

  46. Truncate 40% HIDE 30% Score 20% 10% 0% 100 300 1k 3k 10k 30k 100k Bin Size 24

  47. 25

  48. BEFORE AFTER 25

  49. Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26

  50. Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26

  51. Death by a thousand cuts VS. foo() bar() . . . . . . main() main() main() 26

  52. Death by a thousand cuts VS. foo() bar() . . . . . . . main() main() 26

  53. Death by a thousand cuts VS. foo() bar() . . . . . . . main() main() 26

  54. Death by a thousand cuts VS. foo() bar() . . . . . . . main() main() 26

  55. Death by a thousand cuts VS. foo() bar() bar() . . . . . . main() main() 26

  56. Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26

  57. Death by a thousand cuts VS. foo() bar() . . . . . . main() main() 26

  58. Death by a thousand cuts main() main() . . . . . . foo() bar() 27

  59. Memoro + 28

  60. vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) { vector<BigT> values; values.reserve(largeMap.size()); for (const auto& key: keys) values.emplace_back(largeMap[key]); return values; } 29

  61. Demo 30

  62. Run-Time Overhead Memoro + Visualizer Open Challenges 31

  63. Dumping Profile Your regular 
 service 32

  64. Dumping Profile AtExit() Your regular 
 service 32

  65. Dumping Profile AtExit() Your regular 
 service 32

  66. Dumping Profile AtExit() Facebook 
 service 32

  67. Dumping Profile AtExit() Facebook 
 service 32

  68. Dumping Profile AtExit() Facebook 
 service 32

  69. Dumping Profile AtExit() Facebook 
 service call lldb AtExit() 32

  70. Dumping Profile AtExit() Facebook 
 service call lldb AtExit() 32

  71. Dumping Profile AtExit() Facebook 
 service call lldb AtExit() Signal to dump ( SIGPROF ) a. 32

  72. Dumping Profile AtExit() Facebook 
 service call lldb AtExit() Signal to dump ( SIGPROF ) a. b. Ring buffer + Periodic write 32

  73. Compile-Time Stack Analysis 33

  74. Compile-Time Stack Analysis ld/st 33

  75. Compile-Time Stack Analysis ld/st llvm::GetUnderlyingObject() 33

  76. Compile-Time Stack Analysis 90000 Ratio Instrumented load/store 67500 45000 22500 0 0 1 2 4 8 ld/st GetUnderlyingObject(depth = X) llvm::GetUnderlyingObject() 33

  77. Compile-Time Stack Analysis foo() 90000 Ratio Instrumented load/store bar() 67500 45000 22500 0 0 1 2 4 8 ld/st GetUnderlyingObject(depth = X) llvm::GetUnderlyingObject() 33

  78. Compile-Time Stack Analysis foo() 90000 Ratio Instrumented load/store bar() 67500 45000 22500 0 0 1 2 4 8 ld/st GetUnderlyingObject(depth = X) llvm::GetUnderlyingObject() 33

  79. Thank you! github.com/epfl-vlsc/memoro Thierry Treyer Mark Santaniello James Larus Performance & Performance & EPFL IC School Dean Capacity Intern Capacity Engineer 34

Recommend


More recommend