the chapel tasking layer over qthreads
play

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. - PowerPoint PPT Presentation

The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and Bradford L. Chamberlain Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department


  1. The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler, Richard C. Murphy, Dylan Stark, and Bradford L. Chamberlain Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000. Wednesday, May 18, 2011

  2. The Structure of Chapel’s Runtime Chapel Runtime Support Libraries (written in C) Tasks Communication Launchers Standard Memory Timers Threads Wednesday, May 18, 2011

  3. Chapel’s Tasking Layer •Role: Responsible for parallelism/synchronization •Main Focus: – support begin/cobegin/coforall statements – support synchronization variables •Main Features: – Startup/Teardown – Singleton Tasks – Task Lists – Synchronization – Control – Queries – ...serialization? Wednesday, May 18, 2011

  4. The FIFO Tasking Implementation •Work-queue model –Function calls for work execution –Centralized queue •Pros: •Cons: –Simple, easy to debug –Task synchronization ( sync ) using thread synchronization ( pthread_mutex_t ) –Very portable –Uses native state • Compute/synch overlap requires management oversubscribing (#threads > #cpus) • stacks • Difficult to provide non-native (non-mutex) synchronization behavior • thread/task-specific data –#Task-to-#thread mismatch creates unexpected deadlock potential –Does not support work stealing –Does not support CPU pinning Wednesday, May 18, 2011

  5. Challenges in Highly-Threaded Runtimes •Per-thread state – State vs threads •Locality – An afterthought in standard threading models – Communication and synchronization are expensive, easy to use accidentally •Synchronization – Hard to make portable, maintain guarantees •Every Machine is Different – Granularity of sharing (cacheline size) – Optimal number of threads (PU count) – Communication topology – Cache structure – Memory model – Synchronization Primitives ( CMPXCHG vs TNS vs CASXA vs LDARX / STWCX ) Wednesday, May 18, 2011

  6. Qthreads Highlights •Lightweight User-level Threading (Tasking) •Platform portability –IA32/64, AMD64, PPC32/64, SparcV9, SST, Tilera –Linux, BSD, Solaris, MacOSX •Locality awareness –“Shepherd” as thread mobility domain & locality •Fine-grained synchronization semantics –Full/Empty Bits (64-bit & 60-bit) –Mutexes –Atomic operations (Incr & CAS) •Locality-aware Workstealing Model Wednesday, May 18, 2011

  7. Chapel Single Locale Challenges •Startup & Teardown –Functions with unspecified scope –Synchronization primitives of unspecified scope •Unsupported Behavior –Limit on OS Threads •Default defined by hardware –Forced serialization of tasks –Task-local data Wednesday, May 18, 2011

  8. Chapel Multi-Locale Challenges •Communication (via GASNet) –Blocking system calls •Dedicated OS thread •Possibility for proxying internally •Temporary solution: Forked initialization thread •Future solution: explicit progress thread creation –External Task Operations •Task creation from outside the task library –Memory management issue –Also: synchronization issue … •Task synchronization outside the task library –Proxy-task using thread-level synchronization (pthread_mutex_t) Wednesday, May 18, 2011

  9. Future Work •Synchronization –Tasking interface assumes only mutex semantics –MTA/Qthreads interface provide fast FEB semantics –Implementing FEB semantics with a mutex implemented with FEB operations is silly and slow •Stack Space –Problem common to all tasking interfaces –Currently requires guess-and-check –Potential directions: •Technically possible to calculate stack requirements (e.g. gcc 4.6) •Technically possible to move stack variables to heap –Moves the memory management problem Wednesday, May 18, 2011

  10. Performance: Raw Tasking 100 •QuickSort –Naïve implementation (serial partitioning) 10 Execution Time (secs) –Uses recursive cobegin 1 –Serialization threshold •For best comparison, set high to avoid serialization 0.1 0.01 3 2.5 Ratio FIFO/Qt 2 0.001 1.5 14 16 18 20 22 24 26 28 1 Array Elements (power of 2) 0.5 0 Qthreads FIFO 14 16 18 20 22 24 26 28 Wednesday, May 18, 2011

  11. Performance: Raw Tasking 1000 •Tree Exploration –Constructs binary tree 100 –Assigns Unique ID Execution Time (secs) –Computes sum of IDs 10 –Uses recursive cobegin 1 0.1 2.5 0.01 Ratio FIFO/Qt 2 1.5 0.001 12 14 16 18 20 22 24 26 28 1 0.5 Tree Elements (power of 2) 0 Qthreads FIFO 12 14 16 18 20 22 24 26 28 Wednesday, May 18, 2011

  12. Performance: Data Parallel 1000 •HPCC RandomAccess –GUPS (random integer updates) Execution Time (secs) –Stresses Memory System 100 –Uses forall 10 1.5 1.25 Ratio FIFO/Qt 1 1 0.75 1 2 4 8 16 32 64 128 0.5 Number of Tasks 0.25 0 Qthreads FIFO 1 2 4 8 16 32 64 128 Wednesday, May 18, 2011

  13. Performance: Data Parallel 1 •HPCC STREAM (-EP) –Memory Bandwidth & Vector Kernels 0.8 Execution Time (secs) –EP version avoids communication 0.6 –Uses forall 0.4 –Synchronization surprisingly important 0.2 2 Ratio FIFO/Qt 1.5 0 1 1 2 4 8 16 32 64 128 0.5 Number of Tasks 0 Qthreads FIFO 1 2 4 8 16 32 64 128 Qthreads EP FIFO EP STREAM STREAM-EP Wednesday, May 18, 2011

  14. Thank You! Questions? Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy ʼ s National Nuclear Security Administration under contract DE-AC04-94AL85000. Wednesday, May 18, 2011

Recommend


More recommend