The Chapel Tasking Layer Over Qthreads Kyle B. Wheeler , Sandia National Laboratories * and Richard C. Murphy , Sandia National Laboratories and Dylan Stark , Sandia National Laboratories and Bradford L. Chamberlain , Cray Inc. † ABSTRACT: This paper describes the applicability of the third-party qthread lightweight threading library for implementing the tasking layer for Chapel applications on conventional multisocket multicore computing platforms. A collection of Chapel benchmark codes were used to demonstrate the correctness of the qthread implementation and the performance gain provided by using an optimized threading/tasking layer. The experience of porting Chapel to use qthreads also provides insights into additional requirements imposed by a lightweight user-level threading library, some of which have already been integrated into Chapel, and others that are posed here as open issues for future work. The initial performance results indicate an immediate performance benefit from using qthreads over the native multithreading support in Chapel. Both task and data parallel applications benefit from lower overheads in thread management. Future work on improved synchronization semantics are likely to further increase the efficiency of the qthreads implementation. KEYWORDS: Chapel, lightweight, threading, tasks 1. Introduction guage being developed by Cray Inc. as part of DARPA’s High Productivity Computing System It is increasingly recognized that, in order to obtain program (HPCS). One of its main motivating power and performance scalability, future hardware themes includes support for general parallel pro- architectures will provide large amounts of paral- gramming—data parallelism, task parallelism, con- lelism. Taking full advantage of this parallelism re- current programming, and arbitrary nestings of these quires an ability to specify the parallelism at multi- styles. It also adopts a multiresolution language ple levels within a program. However, parallel pro- design in which higher-level features like arrays gramming is also widely recognized to be a diffi- and data parallel loops are implemented in terms cult problem, and the set of programmers who can of lower-level features like classes and task paral- effectively leverage parallelism is a small fraction lelism. To this end, having a good implementation of those who are effective sequential programmers. of Chapel’s task parallel concepts is crucial since all Addressing the expressibility and programmability parallelism is built in terms of it. challenges are problems of wide interest. Task parallelism, in this case, refers not to the Chapel is a new parallel programming lan- task/data parallelism distinction, but to the idea of a user-level threading concept, wherein tasks that can * Sandia is a multiprogram laboratory operated by Sandia Cor- be executed in parallel are relatively short-lived and poration, a Lockheed Martin Company, for the United States De- are created and destroyed rapidly. To maximize per- partment of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. formance, applications must not only find parallel † This material is based upon work supported by the Defense work, but must also match the amount of parallel Advanced Research Projects Agency under its Agreement No. work expressed to the available hardware. This lat- HR0011-07-9-0001. 1
ter task, however, is one best carried out by a run- per-thread signal vectors, or preemptive multitask- time rather than the application itself. ing. The thread scheduler in Qthreads presumes a Qthreads is a new lightweight threading, or task- cooperative-multitasking approach, which provides ing, library being developed by Sandia National Lab- the flexibility to run threads in locations most con- oratories. The Qthreads runtime is designed to sup- venient to the scheduler and the code. There are port dynamic programming and performance features two scheduling regimes within qthreads: the single- not typically seen in either OpenMP or MPI systems. threaded location mode, which does not use work- Parallel work is specified and the Qthreads runtime stealing, and the multi-threaded hierarchical loca- maps the work onto available hardware resources. tion mode, which uses a shared work-queue between By comparing Qthreads dynamic mapping of tasks multiple workers in a single location and work-steal- to hardware against the default “FIFO” scheduling ing between locations. mechanism of the Chapel runtime, an accurate pic- Blocking synchronization, such as when perform- ture of the benefits of the Qthread model can be ob- ing a FEB operation, triggers a user-space context tained. In task parallelism situations, where cobegin switch. This context switch is done via function is used, Qthreads can outperform the FIFO tasking calls without trapping into the kernel, and therefore layer by as much as 45%. In data parallelism situa- does not require saving as much state as preemp- tions, where forall and coforall are used, Qthreads tive context switches—such as signal masks and the can outperform the FIFO tasking layer by as much full set of registers. This technique allows threads as 30%. Further work is planned to improve syn- to process largely uninterrupted until data is needed chronization performance and eliminate additional that is not yet available, and allows the scheduler to bottlenecks. attempt to hide communication latency by switch- Qthreads is described in more detail in Section 2. ing tasks when data is needed. Logically, this only It is followed by a discussion of the Chapel tasking hides communication latencies that take longer than layer in Section 3. A discussion of the difficulties a context switch. in mapping the Chapel tasking layer to the Qthreads API on single-node systems is in Section 4 and on 3. Chapel Tasking Layer multi-node systems is in Section 5. The results of our performance experiments are in Section 7. Like many implementations of higher-level languages, the Chapel [2] compiler is implemented by compil- ing Chapel source code down to standard C. This 2. Qthreads permits the Chapel compiler to focus on high-level Qthreads [4] is a cross-platform general purpose par- transformations and optimizations while leaving plat- allel runtime designed to support lightweight thread- form-specific targeting and optimizations to the na- ing and synchronization within a flexible integrated tive C compiler on each platform. Most of the lower- locality framework. Qthreads directly supports pro- level code required to execute Chapel is implemented gramming with lightweight threads and a variety of using Chapel’s runtime libraries which are also im- synchronization methods, including both non-block- plemented in C and then linked to the generated code. ing atomic operations and potentially blocking full/ The Chapel runtime libraries are organized as a empty bit (FEB) operations. number of sub-interfaces , each of which implements The Qthreads lightweight threading concept is a specific subset of functionality such as commu- intended to match future hardware threading en- nication, task management, memory management, vironments more closely than existing concepts in or timing routines. Each sub-interface is designed three crucial aspects: anonymity, introspectable such that several distinct implementations can be limited resources, and inherent localization. Un- supplied as long as each supports the interface’s se- like heavyweight threads, these threads do not sup- mantics. An end-user can select from among the im- port expensive features like per-thread identifiers, plementation options via an environment variable. 2
Recommend
More recommend