Thread Tailor Dynamically Weaving Threads Together for Efficient, - PowerPoint PPT Presentation

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark

Motivation • Hardware Trends – Put more cores in a single chip NO! More threads always win? 2009 201X • CPU intensive programs – Exploits Thread Level Parallelism

Optimal Number of Threads • Too many threads – More synchronization – More contention for system resources • Too few threads – Resource underutilization • Who can decide the number? – Not a programmer

Why NOT? • Input changes – Various working-set size • The system changes Decision must be made at runtime – Various available resources • Hardware changes – Various L2/L3 cache structure / size, etc.

Proposal 16 Thr. … OK. I will create Thread Tailor lots of threads > 128 Thr. Combine New Threads Binary … Binary Compile Distribute • Combining Threads – Group Several Threads into a Single Thread • Threads in the same group are executed in serial • Executed on the SAME core

Details Profiler Graphs Profile Instrument Info. Instrumented > 128 Thr. Codes … Collect System Info. Binary Run Combine Algorithm Result Code Generator Combined Codes Thread Tailor Development Distribution

Graph Construction Thread 1 Thread 2 Synchronization Cost Cycles = 10M (cycles) Communication Cost Working-set = 10K

Communication Cost • Intuition : STORE Instruction causes coherence miss in cache • Log Memory Access per Thread Thread 1 Thread 2 Address Address LD Count ST Count LD Count ST Count LD LD … … … … … … 0x00001234 5 10 0x00001234 0 7 ST ST 0x00001338 4 9 0x00002000 4 4 Graph … … … … … … 0x00004000 7 7 0x00004000 3 8 29 1 2 … … … … … … 0x00001234: MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12 0x00004000: MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: 12 + 17 = 29

Combining Algorithm • Kernighan-Lin(KL) Graph Partitioning Heuristic – Goal : Minimize Execution Cycles – Precondition : Combined Threads ≤ Cores 60 60 = 100 Cycles A B E F 60 60 60 60 60 60 60 60 2 Cores C D G H 60 60 10 Partition 1 Cycle Partition2 Cycle Move Move Partition 1 Partition 2 Estimation Estimation From Node B C D G A E F H 210 220 2 A A B C D G E F H 130 120 1 G A B C D E F G H 40 40 1 D

Thread Combining Application Replace Thread APIs with Wrapper Functions Dynamic Compiler Translation Code Cache Wrapper Function for Thread Creation vm_thread_create() No Yes : Create : Create Target to combine? Normal User Thread Thread Thread Thread Context Switched by Dynamic Compiler User User Thread Thread … Serially Execute User Threads in Real Thread Thread …

Experimental Setup • 2 cores – Intel Core 2 Duo 6600 (2.4 Ghz) • 4 cores – Intel Core 2 Quad Q6600 (2.4.Ghz) • 8 cores – 2 Quad-core CPUs with SMT – Intel Xeon E5520 ( 2.26 Ghz ) • 16 cores (Logical) – 2 Quad-core CPUs with SMT and HyperThreading – Intel Xeon E5520 ( 2.26 Ghz )

Results 1.31 1.66 2.36 1.83 1.2 1.15 1.1 Speedup 1.05 1 0.95 0.9 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 fluidanimate transpose blackscholes twister water_n^2 swaptions Core Number

Result Analysis - Transpose • Transpose m * n matrix to n * m 1 4 1 2 3 2 5 4 5 6 3 6 • Parallel Transpose Thread 1 … Thread 2 128 cols distance Input Matrix 128 rows distance Output Matrix …

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 64 Byte Block L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) 512 Byte distance … Input Matrix 16K x 16K L2 private (256K) 128 rows distance L3 Shared (8M) Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB iterates 128 times (128 * 64byte) 8KB … (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) iterates 128 times L3 Shared (8M) … Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) 8KB WRITE HIT! (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) Working-set fits into L1 Cache (No Capacity Miss!) 8KB WRITE HIT! (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

Summary • Choosing Optimal Number of Threads is Hard • Thread Tailor Ease the Pain – Graph Representation – Combine Threads at Runtime

Thank you

Thread Tailor Dynamically Weaving Threads Together for Efficient, - PowerPoint PPT Presentation

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores in a single chip NO! More

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

CPL 2016, week 3 Thread management: execution and shutdown Oleg Batrashev Institute of Computer

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Is This Class Thread-Safe? Inferring Documentation using Graph-Based Learning Andrew Habib,

What is a Thread? A thread lives within a process; A process can have several threads.

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

The Art of Drafting an RFP Tailor your Contract to your Scope Presented by: Brenda Frank

The Encyclopedie of Diderot and dAlembert 1 Tailor of Suits, I This image is in the public

Why Digital Thread? 9.3 Game Changers Digital Thread and Digital Twin DISTRIBUTION STATEMENT A.

The First Globelics Conference on Innovation Systems and Development Strategies Return on

A Ternary Unification Framework for Optimizing TCAM-Based Packet Classification Systems Author:

Getting the Most Out of Your Application Excellence in Tennessee Conference Sonja Wulff, MA, CLM

Our journey in walking beside others to implement PRAPARE Kelly Volkmann Christine Mosbaugh

Softwaretechnologie II Lecture 2 Modelling Dynamic Behavior with Petri Nets : Basics

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

Parramatta Dialogues David Moutou Community Capacity Building Manager 03.04.2019 Welcoming

Measurements of electron emission reduction from grid electrodes in the R&D test platform for

Thread Tailor Dynamically Weaving Threads Together for Efficient, - PowerPoint PPT Presentation

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores in a single chip NO! More

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Design of Thread-Safe Classes 1 Topic Outline Thread-Safe Classes Principles Confinement

Synthesizing Commutativity Conditions Kshitij Bansal Eric Koskinen Omer Tripp New York

Roadmap for Section 4.3. Windows Process and Thread Internals Thread Block, Process Block Flow

Directive-Based Programming with OpenMP Shared Memory Programming Explicit thread creation

CPL 2016, week 3 Thread management: execution and shutdown Oleg Batrashev Institute of Computer

CPL 2016, week 5 Inter-thread collaboration Oleg Batrashev Institute of Computer Science, Tartu,

CS 6958 LECTURE 9 TRAX MEMORY MODEL February 5, 2014 Recap: TRaX Thread DRAM L2 L1 Thread

CPL 2016, week 4 Inter-thread communication Oleg Batrashev Institute of Computer Science, Tartu,

Is This Class Thread-Safe? Inferring Documentation using Graph-Based Learning Andrew Habib,

What is a Thread? A thread lives within a process; A process can have several threads.

MULTITREADING What is a thread? A thread is a concurrent unit of execution Threads share

The Art of Drafting an RFP Tailor your Contract to your Scope Presented by: Brenda Frank

The Encyclopedie of Diderot and dAlembert 1 Tailor of Suits, I This image is in the public

Why Digital Thread? 9.3 Game Changers Digital Thread and Digital Twin DISTRIBUTION STATEMENT A.

The First Globelics Conference on Innovation Systems and Development Strategies Return on

A Ternary Unification Framework for Optimizing TCAM-Based Packet Classification Systems Author:

Getting the Most Out of Your Application Excellence in Tennessee Conference Sonja Wulff, MA, CLM

Our journey in walking beside others to implement PRAPARE Kelly Volkmann Christine Mosbaugh

Softwaretechnologie II Lecture 2 Modelling Dynamic Behavior with Petri Nets : Basics

Think not lightly of good, saying, &quot;It will not come to me. Drop by drop is the water pot

Parramatta Dialogues David Moutou Community Capacity Building Manager 03.04.2019 Welcoming

Measurements of electron emission reduction from grid electrodes in the R&amp;D test platform for

Think not lightly of good, saying, "It will not come to me. Drop by drop is the water pot

Measurements of electron emission reduction from grid electrodes in the R&D test platform for