how s the parallel computing revolution going towards
play

Hows the Parallel Computing Revolution Going? Towards Parallel, - PowerPoint PPT Presentation

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View


  1. How’s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1

  2. 20 th Century Simplistic Hardware View Faster Processors Frequency Scaling Speculation, OO programs do not change they just run faster Kathryn McKinley Towards Parallel, Scalable VM Services 2

  3. Programming Language Evolution Managed Programming Languages Native Programming Languages 3

  4. 20 th Century Simplistic Software View Larger, More Capable Software Managed Languages hardware does not change it just runs faster Kathryn McKinley Towards Parallel, Scalable VM Services 4

  5. Processor Technology Evolution i5 Clarkdale Core 2 Duo (32nm) Core 2 Duo Wolfdale 2010 Conroe Power 5 (45nm) 2009 (65nm) 2 cores 2006 (90nm) 2004 i7 Bloomfield (45nm) 2008 Pentium M Dothan (90nm) 2005 Atom Diamondville Pentium 4 NetBurst (45nm) 2008 (130nm) 2003 5

  6. The 20 th Century Virtuous Cycle ✓ Larger, More Faster Single Capable Processor Software Frequency Scaling Managed Languages Kathryn McKinley Towards Parallel, Scalable VM Services 6

  7. The 21 st Century Virtuous Cycle? ? Scalable Software More Cores Scalable Apps + Chip Multiproccesors CMP Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 7

  8. How is this new virtuous cycle going? Kathryn McKinley Towards Parallel, Scalable VM Services 8

  9. Measured Power vs Performance 2003 50 2008 Power (W) (log) 2006 ? ? 2010 Pentium 4 (130nm) 2008 Core 2 Duo (65nm) i7 (45nm) Core 2 Duo (45nm) i5 (32nm) 10 0.5 1 5 10 Speedup (v Atom 230) (log) SPEC CPU 2006, DaCapo, SPEC jvm98

  10. How is this new virtuous cycle going for single threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 10

  11. Performance Scaling Single Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 3.5 antlr Speedup 3.0 bloat compress 2.5 db fop 2.0 jack javac 1.5 jess mpegaudio 1.0 pmd raytrace 0.5 geomean 1 2 3 4 5 6 7 8 9 Hardware Contexts Kathryn McKinley Towards Parallel, Scalable VM Services 11

  12. How is this new virtuous cycle going for multi-threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 12

  13. Performance Scaling Multi-Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 avrora 3.5 batik Speedup eclipse 3.0 h2 jython 2.5 luindex lusearch 2.0 mtrt pjbb2005 1.5 sunflow tomcat 1.0 tradebeans tradesoap 0.5 xalan 1 2 3 4 5 6 7 8 9 geomean Hardware Contexts Kathryn McKinley Towards Parallel, Scalable VM Services 13

  14. Power, Performance, and Concurrency Native Java Single threaded hollow; multithreaded solid • Microarchitecture changes from Pentium 4 (130) to i5 (32) • favored parallelism-no surprise Multithreaded performance incurs a significant power cost • 14

  15. Is there hope? Kathryn McKinley Towards Parallel, Scalable VM Services 15

  16. Managed Languages Challenges & Opportunities Kathryn McKinley Towards Parallel, Scalable VM Services 16

  17. Must Start with a Scalable Managed Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 17

  18. Sequential Managed Programs time Single Managed Application Core Runtime • Profiling • Dynamic Analysis • Compilation • Garbage Collection • Other Helper Threads • …… Kathryn McKinley Towards Parallel, Scalable VM Services 18

  19. Steps towards scalability Step 1. Parallel application time Core 0 Core 1 Core 2 Application Core 3 Threads Core 4 Core 5 Core 6 Unused cores Core 7 Each thread has different running time Kathryn McKinley Towards Parallel, Scalable VM Services 19

  20. Steps towards scalability Step 2. Parallel runtime time Core 0 Core 1 Core 2 Application Managed Application Core 3 Runtime Threads Threads Core 4 Core 5 Core 6 Core 7 Runtime waits for all application threads to pause Kathryn McKinley Towards Parallel, Scalable VM Services 20

  21. Steps towards scalability Step 3. Parallel & concurrent runtime time Core 0 Core 1 Core 2 Application Managed Application Core 3 Threads Runtime Threads Core 4 Core 5 Core 6 Core 7 Managed runtime on application’s critical path may perturb its performance Kathryn McKinley Towards Parallel, Scalable VM Services 21

  22. Steps towards scalability Ideal model Step 4. Minimize perturbation time Core 0 Core 1 Core 2 Application Application Core 3 Threads Threads Core 4 Core 5 Whole runtime task taken off critical path Core 6 Core 7 Application offloads work to concurrent runtime threads Kathryn McKinley Towards Parallel, Scalable VM Services 22

  23. Steps towards scalability Ideal model Step 4. Minimize perturbation time Core 0 Core 1 Core 2 Application Application Core 3 Threads Threads Core 4 Core 5 Core 6 Core 7 Worst case is parallel & concurrent Kathryn McKinley Towards Parallel, Scalable VM Services 23

  24. Vision • Scalable Runtimes – Runtime & application parallelism & concurrency – CMP aware runtime improves application scalability • Communication – Cache coherency is expensive and performance sensitive – Memory bandwidth scaling is problematic • Heterogeneity – Move non-critical path off power-hungry cores – Smarter, more aggressive analysis • Specialization? – Tuned cores? Special purpose cores? Kathryn McKinley Towards Parallel, Scalable VM Services 24

  25. Approach • Profiling (feedback directed optimization) – Concurrent analysis – More invasive analysis on low-power cores • GC – High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC • JIT – Concurrent, parallel JIT – Cost-benefit shift as low-power cores used • Architecture – Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 25

  26. Today • Profiling (feedback directed optimization) – Concurrent analysis – More invasive analysis on low-power cores • GC – High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC • JIT – Concurrent, parallel JIT – Cost-benefit shift as low-power cores used • Architecture – Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 26

  27. A Concurrent Dynamic Analysis Framework For CMP Hardware Jungwoo Ha Matthew Arnold U. Texas & UCS/ICI-East IBM Research Stephen M. Blackburn Kathryn S. McKinley Australian National University University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 27

  28. Generic Sequential Analysis instrumented code (== overhead) � time Application data collection � analysis � • Difficult to optimize instrumented code • Trade accuracy for overhead (sampling) Kathryn McKinley Towards Parallel, Scalable VM Services 28

  29. Generic Concurrent Analysis instrumented code (reduced overhead) � time Application Application (producer) enqueue � data collection � buffering � Analysis (consumer) dequeue � analysis � • Lower overhead & higher accuracy • Must deal with microarchitectural side-effects Kathryn McKinley Towards Parallel, Scalable VM Services 29

  30. Side-effects to Avoid Core A L1 lower level L1 Core B cache(s) false & true sharing Application Analysis (Producer) (Consumer) High latency memory operation Cache line ping-ponging Kathryn McKinley Towards Parallel, Scalable VM Services 30

  31. Cache-friendly Asymmetric Buffering • Lock-free communication channel between application and analysis thread • Cache-friendly asymmetric buffering – Actively avoids microarchitectural side-effects – Enqueue • light-weight instrumentation • produces one record at time – Dequeue • consumes one chunk (fraction of a buffer) at a time Kathryn McKinley Towards Parallel, Scalable VM Services 31

  32. Cache-friendly Asymmetric Buffering Core A L1 lower level L1 Core B cache(s) Application Analysis (Producer) (Consumer) an analyz yzer w r wai aits s application � ap fo for application writes here � here. � application buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 an analyz yzer � analyzer chunk rea eads here � • 16 slots on the buffer • 4 chunks, 4 slot on each chunk • L1 size == chunk size Kathryn McKinley Towards Parallel, Scalable VM Services 32

  33. Cache-friendly Asymmetric Buffering Core A L1 lower level L1 Core B cache(s) 0 4 8 Application Analysis 5 1 (Producer) (Consumer) 2 6 7 3 application buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 analyzer chunk Delay consumer dequeue operation until cache line is flushed • 2 chunks away (smiley location) – Analyzer operates one chunk at a time • chunk_size > L1 size – In practice, chunk_size >= 2 * L1 works well. – Kathryn McKinley Towards Parallel, Scalable VM Services 33

Recommend


More recommend