analyzing the scalability of managed language
play

Analyzing the Scalability of Managed Language Applications with - PowerPoint PPT Presentation

Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout Understanding Scalability Problems n Multicore n Managed languages l Service threads Speedup


  1. Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout

  2. Understanding Scalability Problems n Multicore n Managed languages l Service threads Ø Speedup Stack • Bar graph that explains causes of sublinear speedup • Ideal speedup of multi-threaded execution over single-threaded versus actual speedup Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout — ISPASS-2012 p. 2

  3. Original Speedup Stacks ideal speedup Speedup delimiters: (# of threads) imbalance This factor is responsible for reducing speedup by synchronization this amount from the Speedup ideal speedup memory or interference cache If completely removed, actual speedup interference gives indication of how speedup much speedup could improve Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout — ISPASS-2012 p. 3

  4. Original Speedup Stacks n Scalability delimiters l Work imbalance l Spinning l Yielding l Last-level cache and memory interference w Positive w Negative ❌ No managed components ❌ Dedicated hardware support Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout — ISPASS-2012 p. 4

  5. Our Contribution n Managed 4 service 3.5 Garbage threads Collector 3 Initialization n On native 2.5 hardware Thread Speedup Imbalance 2 Synchronization 1.5 Other Overheads 1 Measured 0.5 0 p. 5

  6. Managed Speedup Stacks n Scalability delimiters 10%* l Garbage collector l Managed runtime initialization l Synchronization l Thread imbalance l Other overheads w Parallelization overhead w Shared hardware resource interference n On native hardware l Linux kernel modules l < 1% overhead on average *T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, “The yin and yang of power and p. 6 performance for asymmetric hardware and managed software,” ISCA, 2012

  7. � Background Speedup = 𝑢𝑗𝑛𝑓 +,-./01234056 𝑇 = : + 𝑢𝑗𝑛𝑓 78/2,1234056 : ; n Ideal speedup = # of threads (N) T s = 20 T p = 5 T 0 E T 1 𝑈 𝑡 = > 𝑈 𝑞 − > 𝑃 𝑗𝑘 T 2 , D T 3 0 5 10 15 20 p. 7

  8. � � Background Speedup = 𝑢𝑗𝑛𝑓 +,-./01234056 𝑇 = : + 𝑢𝑗𝑛𝑓 78/2,1234056 : ; n Ideal speedup = # of threads (N) E N 𝑈 𝑡 = > 𝑈 𝑞 − > 𝑃 𝑗𝑘 , D O ij E ∑ ∑ 𝑃 𝑗𝑘 𝑈 𝑡 , D S = 𝑇 = 𝑂 − 𝑈 𝑞 𝑈 𝑞 p. 8

  9. Managed: Garbage Collection n When application paused n In original speedup stacks: part of yielding T 0 T 1 T 2 T 3 0 5 10 15 20 p. 9

  10. � Managed: Garbage Collection n When application paused n In original speedup stacks: part of yielding E ∑ 𝑃1 𝑗𝑘 𝑇 = 𝑂 − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 D − > 𝑈 𝑞 𝑈 𝑞 , n If GC were perfectly scalable, component would be 0 p. 10

  11. Managed: Runtime Initialization n Java virtual machine initialization, compilation, shutdown n Application threads not yet spawned, or paused T 0 T 1 T 2 T 3 0 5 10 15 20 p. 11

  12. � Managed: Runtime Initialization n Java virtual machine initialization, compilation, shutdown n Application threads not yet spawned, or paused 𝑇 = 𝑂 − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 𝑈 𝑞 E ∑ 𝑃2 𝑗𝑘 − 𝑂 × 𝑈 𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈 𝑗𝑜𝑗𝑢, 𝑇𝑈 D − > 𝑈 𝑞 𝑈 𝑞 , n If initialization were perfectly scalable, component would be 0 p. 12

  13. Other Speedup Delimiters n Synchronization l When threads wait on each other l Measure wait time inside futex syscall n Thread Imbalance l When thread executes longer than other threads l Measure wait time inside exit syscall n Other Overhead l Parallelization overhead l Hardware interference l Estimated p. 13

  14. � Managed Speedup Stack Measured # threads speedup 𝑇 = 𝑂 Garbage collector − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 𝑈 𝑞 Thread Initialization imbalance − 𝑂 × 𝑈 𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈 𝑗𝑜𝑗𝑢, 𝑇𝑈 Synchronization 𝑈 𝑞 Other E E E ∑ 𝑃4 𝑗𝑘 − ∑ 𝑇𝑧𝑜𝑑 𝑗 − ∑ 𝐹𝑦𝑗𝑢 𝑗 overheads D , , − > 𝑈 𝑞 𝑈 𝑞 𝑈 𝑞 , p. 14

  15. � Managed Speedup Stack 4 GC Garbage 𝑇 = 𝑂 3.5 Collector Initialize Initialization 3 Imbalance Thread − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 2.5 Imbalance Speedup Sync. 𝑈 𝑞 Synchronization 2 Other Other 1.5 − 𝑂 × 𝑈 𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈 𝑗𝑜𝑗𝑢, 𝑇𝑈 Overheads Measured 1 Measured 𝑈 𝑞 E 0.5 E E ∑ 𝑃4 𝑗𝑘 − ∑ 𝑇𝑧𝑜𝑑 𝑗 − ∑ 𝐹𝑦𝑗𝑢 𝑗 D , , − > 0 𝑈 𝑞 𝑈 𝑞 𝑈 𝑞 , p. 15

  16. Experimental Methodology n Java applications from DaCapo 2009 suite n Jikes Research Virtual Machine 3.1.2 n Garbage collector l 2 threads l 13 th iteration for stable behavior l Heap size based on minimum with stop-the-world (STW) collector l STW generational Immix and concurrent collectors n Intel Xeon E5, 8 cores per socket, 20MB LLC n 3.2.37 Linux kernel p. 16

  17. Speedup Stacks with STW GC Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan p. 17

  18. Relative to one thread 0.5 1.5 2.5 3.5 4.5 0 1 2 3 4 Instructions L1-loads lusearch L1-loads-misses LLC-loads LLC-load-misses Performance Counters for STW GC Instructions L1-loads 2 threads pmd L1-loads-misses LLC-loads 4 threads LLC-load-misses Instructions 8 threads L1-loads sunflow L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads xalan L1-loads-misses LLC-loads p. 18 LLC-load-misses

  19. Concurrent GC, Same Heap Size Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan p. 19

  20. Concurrent GC, Large Heap Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan p. 20

  21. Relative to one thread 0.2 0.4 0.6 0.8 1.2 1.4 1.6 1.8 0 1 2 Instructions L1-loads lusearch L1-loads-misses LLC-loads LLC-load-misses Perf Cntrs, Concurrent GC, Large Heap Instructions 2 threads L1-loads pmd L1-loads-misses LLC-loads 4 threads LLC-load-misses Instructions 8 threads L1-loads sunflow L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads xalan L1-loads-misses LLC-loads p. 21 LLC-load-misses

  22. Comparison Across Collectors 8 threads Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 stw conc conc large stw conc conc large stw conc conc large stw conc conc large lusearch pmd sunflow xalan p. 22

  23. Related Work n Commerical l Intel VTune Amplifier XE l Sun Studio Performance Analyzer l Rogue Wave/Acumem ThreadSpotter l PGPROF n IBM WAIT n Criticality stacks & Bottle graphs Ø None quantify gross scalability bottlenecks, most don’t analyze service threads p. 23

  24. Conclusions: Managed Speedup Stacks n Visualize scalability 4 bottlenecks 3.5 Garbage n Show relative contributions Collector of components 3 Initialization l Garbage collector 2.5 Thread Speedup l Managed runtime Imbalance 2 initialization Synchronization n On native hardware at low 1.5 Other overhead Overheads 1 Measured n Show where to focus 0.5 optimization: application or 0 service threads p. 24

Recommend


More recommend