Analyzing the Scalability of Managed Language Applications with - PowerPoint PPT Presentation

Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout

Understanding Scalability Problems n Multicore n Managed languages l Service threads Ø Speedup Stack • Bar graph that explains causes of sublinear speedup • Ideal speedup of multi-threaded execution over single-threaded versus actual speedup Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout — ISPASS-2012 p. 2

Original Speedup Stacks ideal speedup Speedup delimiters: (# of threads) imbalance This factor is responsible for reducing speedup by synchronization this amount from the Speedup ideal speedup memory or interference cache If completely removed, actual speedup interference gives indication of how speedup much speedup could improve Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout — ISPASS-2012 p. 3

Original Speedup Stacks n Scalability delimiters l Work imbalance l Spinning l Yielding l Last-level cache and memory interference w Positive w Negative ❌ No managed components ❌ Dedicated hardware support Speedup Stacks — Stijn Eyerman, Kristof Du Bois, Lieven Eeckhout — ISPASS-2012 p. 4

Our Contribution n Managed 4 service 3.5 Garbage threads Collector 3 Initialization n On native 2.5 hardware Thread Speedup Imbalance 2 Synchronization 1.5 Other Overheads 1 Measured 0.5 0 p. 5

Managed Speedup Stacks n Scalability delimiters 10%* l Garbage collector l Managed runtime initialization l Synchronization l Thread imbalance l Other overheads w Parallelization overhead w Shared hardware resource interference n On native hardware l Linux kernel modules l < 1% overhead on average *T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, “The yin and yang of power and p. 6 performance for asymmetric hardware and managed software,” ISCA, 2012

� Background Speedup = 𝑢𝑗𝑛𝑓 +,-./01234056 𝑇 = : + 𝑢𝑗𝑛𝑓 78/2,1234056 : ; n Ideal speedup = # of threads (N) T s = 20 T p = 5 T 0 E T 1 𝑈 𝑡 = > 𝑈 𝑞 − > 𝑃 𝑗𝑘 T 2 , D T 3 0 5 10 15 20 p. 7

� � Background Speedup = 𝑢𝑗𝑛𝑓 +,-./01234056 𝑇 = : + 𝑢𝑗𝑛𝑓 78/2,1234056 : ; n Ideal speedup = # of threads (N) E N 𝑈 𝑡 = > 𝑈 𝑞 − > 𝑃 𝑗𝑘 , D O ij E ∑ ∑ 𝑃 𝑗𝑘 𝑈 𝑡 , D S = 𝑇 = 𝑂 − 𝑈 𝑞 𝑈 𝑞 p. 8

Managed: Garbage Collection n When application paused n In original speedup stacks: part of yielding T 0 T 1 T 2 T 3 0 5 10 15 20 p. 9

� Managed: Garbage Collection n When application paused n In original speedup stacks: part of yielding E ∑ 𝑃1 𝑗𝑘 𝑇 = 𝑂 − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 D − > 𝑈 𝑞 𝑈 𝑞 , n If GC were perfectly scalable, component would be 0 p. 10

Managed: Runtime Initialization n Java virtual machine initialization, compilation, shutdown n Application threads not yet spawned, or paused T 0 T 1 T 2 T 3 0 5 10 15 20 p. 11

� Managed: Runtime Initialization n Java virtual machine initialization, compilation, shutdown n Application threads not yet spawned, or paused 𝑇 = 𝑂 − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 𝑈 𝑞 E ∑ 𝑃2 𝑗𝑘 − 𝑂 × 𝑈 𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈 𝑗𝑜𝑗𝑢, 𝑇𝑈 D − > 𝑈 𝑞 𝑈 𝑞 , n If initialization were perfectly scalable, component would be 0 p. 12

Other Speedup Delimiters n Synchronization l When threads wait on each other l Measure wait time inside futex syscall n Thread Imbalance l When thread executes longer than other threads l Measure wait time inside exit syscall n Other Overhead l Parallelization overhead l Hardware interference l Estimated p. 13

� Managed Speedup Stack Measured # threads speedup 𝑇 = 𝑂 Garbage collector − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 𝑈 𝑞 Thread Initialization imbalance − 𝑂 × 𝑈 𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈 𝑗𝑜𝑗𝑢, 𝑇𝑈 Synchronization 𝑈 𝑞 Other E E E ∑ 𝑃4 𝑗𝑘 − ∑ 𝑇𝑧𝑜𝑑 𝑗 − ∑ 𝐹𝑦𝑗𝑢 𝑗 overheads D , , − > 𝑈 𝑞 𝑈 𝑞 𝑈 𝑞 , p. 14

� Managed Speedup Stack 4 GC Garbage 𝑇 = 𝑂 3.5 Collector Initialize Initialization 3 Imbalance Thread − 𝑂 × 𝑈 𝐻𝐷, 𝑁𝑈 − 𝑈 𝐻𝐷, 𝑇𝑈 2.5 Imbalance Speedup Sync. 𝑈 𝑞 Synchronization 2 Other Other 1.5 − 𝑂 × 𝑈 𝑗𝑜𝑗𝑢, 𝑁𝑈 − 𝑈 𝑗𝑜𝑗𝑢, 𝑇𝑈 Overheads Measured 1 Measured 𝑈 𝑞 E 0.5 E E ∑ 𝑃4 𝑗𝑘 − ∑ 𝑇𝑧𝑜𝑑 𝑗 − ∑ 𝐹𝑦𝑗𝑢 𝑗 D , , − > 0 𝑈 𝑞 𝑈 𝑞 𝑈 𝑞 , p. 15

Experimental Methodology n Java applications from DaCapo 2009 suite n Jikes Research Virtual Machine 3.1.2 n Garbage collector l 2 threads l 13 th iteration for stable behavior l Heap size based on minimum with stop-the-world (STW) collector l STW generational Immix and concurrent collectors n Intel Xeon E5, 8 cores per socket, 20MB LLC n 3.2.37 Linux kernel p. 16

Speedup Stacks with STW GC Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan p. 17

Relative to one thread 0.5 1.5 2.5 3.5 4.5 0 1 2 3 4 Instructions L1-loads lusearch L1-loads-misses LLC-loads LLC-load-misses Performance Counters for STW GC Instructions L1-loads 2 threads pmd L1-loads-misses LLC-loads 4 threads LLC-load-misses Instructions 8 threads L1-loads sunflow L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads xalan L1-loads-misses LLC-loads p. 18 LLC-load-misses

Concurrent GC, Same Heap Size Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan p. 19

Concurrent GC, Large Heap Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads 2 threads 4 threads 8 threads lusearch pmd sunflow xalan p. 20

Relative to one thread 0.2 0.4 0.6 0.8 1.2 1.4 1.6 1.8 0 1 2 Instructions L1-loads lusearch L1-loads-misses LLC-loads LLC-load-misses Perf Cntrs, Concurrent GC, Large Heap Instructions 2 threads L1-loads pmd L1-loads-misses LLC-loads 4 threads LLC-load-misses Instructions 8 threads L1-loads sunflow L1-loads-misses LLC-loads LLC-load-misses Instructions L1-loads xalan L1-loads-misses LLC-loads p. 21 LLC-load-misses

Comparison Across Collectors 8 threads Measured Other Overheads Synchronization Thread Imbalance Initialization Garbage Collector 8 7 6 5 Speedup 4 3 2 1 0 stw conc conc large stw conc conc large stw conc conc large stw conc conc large lusearch pmd sunflow xalan p. 22

Related Work n Commerical l Intel VTune Amplifier XE l Sun Studio Performance Analyzer l Rogue Wave/Acumem ThreadSpotter l PGPROF n IBM WAIT n Criticality stacks & Bottle graphs Ø None quantify gross scalability bottlenecks, most don’t analyze service threads p. 23

Conclusions: Managed Speedup Stacks n Visualize scalability 4 bottlenecks 3.5 Garbage n Show relative contributions Collector of components 3 Initialization l Garbage collector 2.5 Thread Speedup l Managed runtime Imbalance 2 initialization Synchronization n On native hardware at low 1.5 Other overhead Overheads 1 Measured n Show where to focus 0.5 optimization: application or 0 service threads p. 24

Analyzing the Scalability of Managed Language Applications with - PowerPoint PPT Presentation

Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout Understanding Scalability Problems n Multicore n Managed languages l Service threads Speedup

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

Introducing Sterling Managed Accounts Managed Accounts Like a managed fund (and fund of funds)

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Managed Lanes in California: Where Weve Been Where We ve Been Where Were Going Joe Rouse

Managed Services Managed Services Managed Services Welcome to Kaseya.edu www.kaseya.com

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Hidden Scalability Gotchas Gotchas Hidden Scalability in Memcached Memcached and Friends and

Improving Scalability and Fault Improving Scalability and Fault Tolerance in an Application

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

Scalability: Pushing the Limits PNSQC Presentation, October 2014 Neha Rai, Tim Schooley, Tejas

Scalability Testing of Kadeploy using Virtual Machines on Grid5000 Luc Sarzyniec, S

Performance Measurement & Data Committee August 13, 2018 Meeting Agenda 10:30 10:40

Webinar: Attachment 7 Refresh Workgroup Update September 24, 2019 AGENDA Time Topic Presenter

Software Engineering Prof. Dr. Bertrand Meyer March 2007 June 2007 Autom atic code

Informa(on Model for Wavelength Switched Op(cal Networks (WSON)

PA PACE Programming Languages, Architecture and Compilers Education Laboratory Heap analysis

CLUSTERING STATIC ANALYSIS DEFECT REPORTS TO REDUCE MAINTENANCE COSTS Zachary P. Fry and Westley

On the Phenomenon of Drifting Subpulses Dipanjan Mitra Visiting at Univ. Of Vermont From: NCRA,

JRA1 T2, Photonic Services What has been done SKALAT MAR RNAR RSTA NEMA RA VEL KUNNI.

Analyzing the Scalability of Managed Language Applications with - PowerPoint PPT Presentation

Analyzing the Scalability of Managed Language Applications with Speedup Stacks Jennifer B. Sartor Kristof Du Bois Stijn Eyerman Lieven Eeckhout Understanding Scalability Problems n Multicore n Managed languages l Service threads Speedup

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

Introducing Sterling Managed Accounts Managed Accounts Like a managed fund (and fund of funds)

Versioning of Topic Map Templates Structuring Versioning and Scalability Scalability Proc.

Managed Lanes in California: Where Weve Been Where We ve Been Where Were Going Joe Rouse

Managed Services Managed Services Managed Services Welcome to Kaseya.edu www.kaseya.com

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Hidden Scalability Gotchas Gotchas Hidden Scalability in Memcached Memcached and Friends and

Improving Scalability and Fault Improving Scalability and Fault Tolerance in an Application

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

Scalability: Pushing the Limits PNSQC Presentation, October 2014 Neha Rai, Tim Schooley, Tejas

Scalability Testing of Kadeploy using Virtual Machines on Grid5000 Luc Sarzyniec, S

Performance Measurement &amp; Data Committee August 13, 2018 Meeting Agenda 10:30 10:40

Webinar: Attachment 7 Refresh Workgroup Update September 24, 2019 AGENDA Time Topic Presenter

Software Engineering Prof. Dr. Bertrand Meyer March 2007 June 2007 Autom atic code

Informa(on Model for Wavelength Switched Op(cal Networks (WSON)

PA PACE Programming Languages, Architecture and Compilers Education Laboratory Heap analysis

CLUSTERING STATIC ANALYSIS DEFECT REPORTS TO REDUCE MAINTENANCE COSTS Zachary P. Fry and Westley

On the Phenomenon of Drifting Subpulses Dipanjan Mitra Visiting at Univ. Of Vermont From: NCRA,

JRA1 T2, Photonic Services What has been done SKALAT MAR RNAR RSTA NEMA RA VEL KUNNI.

Performance Measurement & Data Committee August 13, 2018 Meeting Agenda 10:30 10:40