The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and Bradford L. Chamberlain Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000. Wednesday, May 18, 2011
The Structure of Chapel’s Runtime Chapel Runtime Support Libraries (written in C) Tasks Communication Launchers Standard Memory Timers Threads Wednesday, May 18, 2011
Chapel’s Tasking Layer •Role: Responsible for parallelism/synchronization •Main Focus: – support begin/cobegin/coforall statements – support synchronization variables •Main Features: – Startup/Teardown – Singleton Tasks – Task Lists – Synchronization – Control – Queries – ...serialization? Wednesday, May 18, 2011
The FIFO Tasking Implementation •Work-queue model –Function calls for work execution –Centralized queue •Pros: •Cons: –Simple, easy to debug –Task synchronization ( sync ) using thread synchronization ( pthread_mutex_t ) –Very portable –Uses native state • Compute/synch overlap requires management oversubscribing (#threads > #cpus) • stacks • Difficult to provide non-native (non-mutex) synchronization behavior • thread/task-specific data –#Task-to-#thread mismatch creates unexpected deadlock potential –Does not support work stealing –Does not support CPU pinning Wednesday, May 18, 2011
Challenges in Highly-Threaded Runtimes •Per-thread state – State vs threads •Locality – An afterthought in standard threading models – Communication and synchronization are expensive, easy to use accidentally •Synchronization – Hard to make portable, maintain guarantees •Every Machine is Different – Granularity of sharing (cacheline size) – Optimal number of threads (PU count) – Communication topology – Cache structure – Memory model – Synchronization Primitives ( CMPXCHG vs TNS vs CASXA vs LDARX / STWCX ) Wednesday, May 18, 2011
Qthreads Highlights •Lightweight User-level Threading (Tasking) •Platform portability –IA32/64, AMD64, PPC32/64, SparcV9, SST, Tilera –Linux, BSD, Solaris, MacOSX •Locality awareness –“Shepherd” as thread mobility domain & locality •Fine-grained synchronization semantics –Full/Empty Bits (64-bit & 60-bit) –Mutexes –Atomic operations (Incr & CAS) •Locality-aware Workstealing Model Wednesday, May 18, 2011
Chapel Single Locale Challenges •Startup & Teardown –Functions with unspecified scope –Synchronization primitives of unspecified scope •Unsupported Behavior –Limit on OS Threads •Default defined by hardware –Forced serialization of tasks –Task-local data Wednesday, May 18, 2011
Chapel Multi-Locale Challenges •Communication (via GASNet) –Blocking system calls •Dedicated OS thread •Possibility for proxying internally •Temporary solution: Forked initialization thread •Future solution: explicit progress thread creation –External Task Operations •Task creation from outside the task library –Memory management issue –Also: synchronization issue … •Task synchronization outside the task library –Proxy-task using thread-level synchronization (pthread_mutex_t) Wednesday, May 18, 2011
Future Work •Synchronization –Tasking interface assumes only mutex semantics –MTA/Qthreads interface provide fast FEB semantics –Implementing FEB semantics with a mutex implemented with FEB operations is silly and slow •Stack Space –Problem common to all tasking interfaces –Currently requires guess-and-check –Potential directions: •Technically possible to calculate stack requirements (e.g. gcc 4.6) •Technically possible to move stack variables to heap –Moves the memory management problem Wednesday, May 18, 2011
Performance: Raw Tasking 100 •QuickSort –Naïve implementation (serial partitioning) 10 Execution Time (secs) –Uses recursive cobegin 1 –Serialization threshold •For best comparison, set high to avoid serialization 0.1 0.01 3 2.5 Ratio FIFO/Qt 2 0.001 1.5 14 16 18 20 22 24 26 28 1 Array Elements (power of 2) 0.5 0 Qthreads FIFO 14 16 18 20 22 24 26 28 Wednesday, May 18, 2011
Performance: Raw Tasking 1000 •Tree Exploration –Constructs binary tree 100 –Assigns Unique ID Execution Time (secs) –Computes sum of IDs 10 –Uses recursive cobegin 1 0.1 2.5 0.01 Ratio FIFO/Qt 2 1.5 0.001 12 14 16 18 20 22 24 26 28 1 0.5 Tree Elements (power of 2) 0 Qthreads FIFO 12 14 16 18 20 22 24 26 28 Wednesday, May 18, 2011
Performance: Data Parallel 1000 •HPCC RandomAccess –GUPS (random integer updates) Execution Time (secs) –Stresses Memory System 100 –Uses forall 10 1.5 1.25 Ratio FIFO/Qt 1 1 0.75 1 2 4 8 16 32 64 128 0.5 Number of Tasks 0.25 0 Qthreads FIFO 1 2 4 8 16 32 64 128 Wednesday, May 18, 2011
Performance: Data Parallel 1 •HPCC STREAM (-EP) –Memory Bandwidth & Vector Kernels 0.8 Execution Time (secs) –EP version avoids communication 0.6 –Uses forall 0.4 –Synchronization surprisingly important 0.2 2 Ratio FIFO/Qt 1.5 0 1 1 2 4 8 16 32 64 128 0.5 Number of Tasks 0 Qthreads FIFO 1 2 4 8 16 32 64 128 Qthreads EP FIFO EP STREAM STREAM-EP Wednesday, May 18, 2011
Thank You! Questions? Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000. Wednesday, May 18, 2011
Recommend
More recommend