thread tailor
play

Thread Tailor Dynamically Weaving Threads Together for Efficient, - PowerPoint PPT Presentation

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores in a single chip NO! More


  1. Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark

  2. Motivation • Hardware Trends – Put more cores in a single chip NO! More threads always win? 2009 201X • CPU intensive programs – Exploits Thread Level Parallelism

  3. Optimal Number of Threads • Too many threads – More synchronization – More contention for system resources • Too few threads – Resource underutilization • Who can decide the number? – Not a programmer

  4. Why NOT? • Input changes – Various working-set size • The system changes Decision must be made at runtime – Various available resources • Hardware changes – Various L2/L3 cache structure / size, etc.

  5. Proposal 16 Thr. … OK. I will create Thread Tailor lots of threads > 128 Thr. Combine New Threads Binary … Binary Compile Distribute • Combining Threads – Group Several Threads into a Single Thread • Threads in the same group are executed in serial • Executed on the SAME core

  6. Details Profiler Graphs Profile Instrument Info. Instrumented > 128 Thr. Codes … Collect System Info. Binary Run Combine Algorithm Result Code Generator Combined Codes Thread Tailor Development Distribution

  7. Graph Construction Thread 1 Thread 2 Synchronization Cost Cycles = 10M (cycles) Communication Cost Working-set = 10K

  8. Communication Cost • Intuition : STORE Instruction causes coherence miss in cache • Log Memory Access per Thread Thread 1 Thread 2 Address Address LD Count ST Count LD Count ST Count LD LD … … … … … … 0x00001234 5 10 0x00001234 0 7 ST ST 0x00001338 4 9 0x00002000 4 4 Graph … … … … … … 0x00004000 7 7 0x00004000 3 8 29 1 2 … … … … … … 0x00001234: MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12 0x00004000: MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: 12 + 17 = 29

  9. Combining Algorithm • Kernighan-Lin(KL) Graph Partitioning Heuristic – Goal : Minimize Execution Cycles – Precondition : Combined Threads ≤ Cores 60 60 = 100 Cycles A B E F 60 60 60 60 60 60 60 60 2 Cores C D G H 60 60 10 Partition 1 Cycle Partition2 Cycle Move Move Partition 1 Partition 2 Estimation Estimation From Node B C D G A E F H 210 220 2 A A B C D G E F H 130 120 1 G A B C D E F G H 40 40 1 D

  10. Thread Combining Application Replace Thread APIs with Wrapper Functions Dynamic Compiler Translation Code Cache Wrapper Function for Thread Creation vm_thread_create() No Yes : Create : Create Target to combine? Normal User Thread Thread Thread Thread Context Switched by Dynamic Compiler User User Thread Thread … Serially Execute User Threads in Real Thread Thread …

  11. Experimental Setup • 2 cores – Intel Core 2 Duo 6600 (2.4 Ghz) • 4 cores – Intel Core 2 Quad Q6600 (2.4.Ghz) • 8 cores – 2 Quad-core CPUs with SMT – Intel Xeon E5520 ( 2.26 Ghz ) • 16 cores (Logical) – 2 Quad-core CPUs with SMT and HyperThreading – Intel Xeon E5520 ( 2.26 Ghz )

  12. Results 1.31 1.66 2.36 1.83 1.2 1.15 1.1 Speedup 1.05 1 0.95 0.9 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 fluidanimate transpose blackscholes twister water_n^2 swaptions Core Number

  13. Result Analysis - Transpose • Transpose m * n matrix to n * m 1 4 1 2 3 2 5 4 5 6 3 6 • Parallel Transpose Thread 1 … Thread 2 128 cols distance Input Matrix 128 rows distance Output Matrix …

  14. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

  15. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 64 Byte Block L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

  16. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

  17. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) 512 Byte distance … Input Matrix 16K x 16K L2 private (256K) 128 rows distance L3 Shared (8M) Output Matrix 16K x 16K

  18. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB iterates 128 times (128 * 64byte) 8KB … (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) iterates 128 times L3 Shared (8M) … Output Matrix 16K x 16K

  19. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

  20. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) 8KB WRITE HIT! (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

  21. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) Working-set fits into L1 Cache (No Capacity Miss!) 8KB WRITE HIT! (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

  22. Summary • Choosing Optimal Number of Threads is Hard • Thread Tailor Ease the Pain – Graph Representation – Combine Threads at Runtime

  23. Thank you

Recommend


More recommend