How’s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1
20 th Century Simplistic Hardware View Faster Processors Frequency Scaling Speculation, OO programs do not change they just run faster Kathryn McKinley Towards Parallel, Scalable VM Services 2
Programming Language Evolution Managed Programming Languages Native Programming Languages 3
20 th Century Simplistic Software View Larger, More Capable Software Managed Languages hardware does not change it just runs faster Kathryn McKinley Towards Parallel, Scalable VM Services 4
Processor Technology Evolution i5 Clarkdale Core 2 Duo (32nm) Core 2 Duo Wolfdale 2010 Conroe Power 5 (45nm) 2009 (65nm) 2 cores 2006 (90nm) 2004 i7 Bloomfield (45nm) 2008 Pentium M Dothan (90nm) 2005 Atom Diamondville Pentium 4 NetBurst (45nm) 2008 (130nm) 2003 5
The 20 th Century Virtuous Cycle ✓ Larger, More Faster Single Capable Processor Software Frequency Scaling Managed Languages Kathryn McKinley Towards Parallel, Scalable VM Services 6
The 21 st Century Virtuous Cycle? ? Scalable Software More Cores Scalable Apps + Chip Multiproccesors CMP Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 7
How is this new virtuous cycle going? Kathryn McKinley Towards Parallel, Scalable VM Services 8
Measured Power vs Performance 2003 50 2008 Power (W) (log) 2006 ? ? 2010 Pentium 4 (130nm) 2008 Core 2 Duo (65nm) i7 (45nm) Core 2 Duo (45nm) i5 (32nm) 10 0.5 1 5 10 Speedup (v Atom 230) (log) SPEC CPU 2006, DaCapo, SPEC jvm98
How is this new virtuous cycle going for single threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 10
Performance Scaling Single Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 3.5 antlr Speedup 3.0 bloat compress 2.5 db fop 2.0 jack javac 1.5 jess mpegaudio 1.0 pmd raytrace 0.5 geomean 1 2 3 4 5 6 7 8 9 Hardware Contexts Kathryn McKinley Towards Parallel, Scalable VM Services 11
How is this new virtuous cycle going for multi-threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 12
Performance Scaling Multi-Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 avrora 3.5 batik Speedup eclipse 3.0 h2 jython 2.5 luindex lusearch 2.0 mtrt pjbb2005 1.5 sunflow tomcat 1.0 tradebeans tradesoap 0.5 xalan 1 2 3 4 5 6 7 8 9 geomean Hardware Contexts Kathryn McKinley Towards Parallel, Scalable VM Services 13
Power, Performance, and Concurrency Native Java Single threaded hollow; multithreaded solid • Microarchitecture changes from Pentium 4 (130) to i5 (32) • favored parallelism-no surprise Multithreaded performance incurs a significant power cost • 14
Is there hope? Kathryn McKinley Towards Parallel, Scalable VM Services 15
Managed Languages Challenges & Opportunities Kathryn McKinley Towards Parallel, Scalable VM Services 16
Must Start with a Scalable Managed Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 17
Sequential Managed Programs time Single Managed Application Core Runtime • Profiling • Dynamic Analysis • Compilation • Garbage Collection • Other Helper Threads • …… Kathryn McKinley Towards Parallel, Scalable VM Services 18
Steps towards scalability Step 1. Parallel application time Core 0 Core 1 Core 2 Application Core 3 Threads Core 4 Core 5 Core 6 Unused cores Core 7 Each thread has different running time Kathryn McKinley Towards Parallel, Scalable VM Services 19
Steps towards scalability Step 2. Parallel runtime time Core 0 Core 1 Core 2 Application Managed Application Core 3 Runtime Threads Threads Core 4 Core 5 Core 6 Core 7 Runtime waits for all application threads to pause Kathryn McKinley Towards Parallel, Scalable VM Services 20
Steps towards scalability Step 3. Parallel & concurrent runtime time Core 0 Core 1 Core 2 Application Managed Application Core 3 Threads Runtime Threads Core 4 Core 5 Core 6 Core 7 Managed runtime on application’s critical path may perturb its performance Kathryn McKinley Towards Parallel, Scalable VM Services 21
Steps towards scalability Ideal model Step 4. Minimize perturbation time Core 0 Core 1 Core 2 Application Application Core 3 Threads Threads Core 4 Core 5 Whole runtime task taken off critical path Core 6 Core 7 Application offloads work to concurrent runtime threads Kathryn McKinley Towards Parallel, Scalable VM Services 22
Steps towards scalability Ideal model Step 4. Minimize perturbation time Core 0 Core 1 Core 2 Application Application Core 3 Threads Threads Core 4 Core 5 Core 6 Core 7 Worst case is parallel & concurrent Kathryn McKinley Towards Parallel, Scalable VM Services 23
Vision • Scalable Runtimes – Runtime & application parallelism & concurrency – CMP aware runtime improves application scalability • Communication – Cache coherency is expensive and performance sensitive – Memory bandwidth scaling is problematic • Heterogeneity – Move non-critical path off power-hungry cores – Smarter, more aggressive analysis • Specialization? – Tuned cores? Special purpose cores? Kathryn McKinley Towards Parallel, Scalable VM Services 24
Approach • Profiling (feedback directed optimization) – Concurrent analysis – More invasive analysis on low-power cores • GC – High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC • JIT – Concurrent, parallel JIT – Cost-benefit shift as low-power cores used • Architecture – Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 25
Today • Profiling (feedback directed optimization) – Concurrent analysis – More invasive analysis on low-power cores • GC – High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC • JIT – Concurrent, parallel JIT – Cost-benefit shift as low-power cores used • Architecture – Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 26
A Concurrent Dynamic Analysis Framework For CMP Hardware Jungwoo Ha Matthew Arnold U. Texas & UCS/ICI-East IBM Research Stephen M. Blackburn Kathryn S. McKinley Australian National University University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 27
Generic Sequential Analysis instrumented code (== overhead) � time Application data collection � analysis � • Difficult to optimize instrumented code • Trade accuracy for overhead (sampling) Kathryn McKinley Towards Parallel, Scalable VM Services 28
Generic Concurrent Analysis instrumented code (reduced overhead) � time Application Application (producer) enqueue � data collection � buffering � Analysis (consumer) dequeue � analysis � • Lower overhead & higher accuracy • Must deal with microarchitectural side-effects Kathryn McKinley Towards Parallel, Scalable VM Services 29
Side-effects to Avoid Core A L1 lower level L1 Core B cache(s) false & true sharing Application Analysis (Producer) (Consumer) High latency memory operation Cache line ping-ponging Kathryn McKinley Towards Parallel, Scalable VM Services 30
Cache-friendly Asymmetric Buffering • Lock-free communication channel between application and analysis thread • Cache-friendly asymmetric buffering – Actively avoids microarchitectural side-effects – Enqueue • light-weight instrumentation • produces one record at time – Dequeue • consumes one chunk (fraction of a buffer) at a time Kathryn McKinley Towards Parallel, Scalable VM Services 31
Cache-friendly Asymmetric Buffering Core A L1 lower level L1 Core B cache(s) Application Analysis (Producer) (Consumer) an analyz yzer w r wai aits s application � ap fo for application writes here � here. � application buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 an analyz yzer � analyzer chunk rea eads here � • 16 slots on the buffer • 4 chunks, 4 slot on each chunk • L1 size == chunk size Kathryn McKinley Towards Parallel, Scalable VM Services 32
Cache-friendly Asymmetric Buffering Core A L1 lower level L1 Core B cache(s) 0 4 8 Application Analysis 5 1 (Producer) (Consumer) 2 6 7 3 application buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 analyzer chunk Delay consumer dequeue operation until cache line is flushed • 2 chunks away (smiley location) – Analyzer operates one chunk at a time • chunk_size > L1 size – In practice, chunk_size >= 2 * L1 works well. – Kathryn McKinley Towards Parallel, Scalable VM Services 33
Recommend
More recommend