Concise parallelism
Natural C/C++ Parallelism A single operator to control multiple parallel programming paradigms void salute() { parallel() { Natural C/C++ semantics int idx = pix(); and variable visibility A single operator to serial() rules and scopes { control parallel parallel(3) synchronization { printf("Hello, world, from task %d-%d\n", idx, pix()); } } } } Clear means of parallel identification and interaction
Elegant Multitasking std::vector<Data> data; parallel(5000000) { int i = pix(); Synchronized access to serial(&data[i]) any data element { data[i].process(); without introducing } synchronization objects } Stack Each thread from a pool decrements Single Execution State the task counter and “creates” a job to { Task No. = 5000000; execute from a single execution state: Code pointer; Registers; • No CPU oversubscription } • Dynamic work balancing • Minimal memory footprint • No task queue management overhead
Language-Friendly Multithreading A single operator to A real independent thread in a control multi-threading class X class constructor! { and multitasking void* volatile id; X() Getting a global ID promotes a { parallel(2) task to an independent thread { void* pid = pid(); Thread-0 returns, if(pix()) thread-1 waits until { id = pid; woken up by another thread/task while(id) void X::read() { { wait(); wake(id); getMoreData(); processData(); } } } break; Reaching the break } } demotes a thread to a task };
Easy Software Analysis Use the same compiler, std::vector<Data> data; debugger and profiler tools as for sequential void f(int n) software { parallel(data.size) { /// Timing: 5 sec; Parallelism = 95%; Time per CPU: CPU0 = 30%, CPU1 = 30%... for(int i = 0; i < n; i++) { /// Avrg iterations = 100 int j = pix(); parallel() { /// Timing: 4.5 sec; Parallelism = 80%; Time per CPU: CPU0 = 30%, CPU1 = 30%... data[j].process(); serial() { /// Timing: 4.5 sec; Contention = 30%; data[j].reduce(); } } C= source code is a perfect performance model by itself : a } C= profiler can annotate each parallel, sequential and cyclic } region with timings, contention, iterations, balance, etc. } exactly in alignment with a corresponding operator
Software Implications Re-writing parallel runtimes in C= A powerful parallel will eliminate CPU oversubscription programming language … OpenMP and guarantee efficient resource management , especially in complex, multi-module applications using C= TBB several parallel runtimes simultaneously Cilk CRT PPL AMP @CPU …and a unified parallel runtime OpenCL @CPU
Hardware Implications Slide a tablet into an Memory accelerator box and get std::vector<Data> data; faster software, vivid parallel(data.size) graphics, detailed scenes, { CPU real-time video encoding PU PU PU data[pix()].process(); } – right away! CPU PU PU PU PU PU PU CPU Single Execution State { Task No. = data.size; Code pointer; Registers; } Co-processors fetch the state transparently C= programs are designed for to CPU and OS and massive parallelism w/o smoothly accelerate incurring extra overhead by execution of existing forming a single execution state programs for any number of parallel tasks Truly mobile, data-consistent, cheap and powerful architecture!
One Program Fits All Memory std::vector<Data> data; parallel(data.size) { CPU coload() { CPU data[pix()].process(); } CPU } Remote agents may Single Execution State concurrently “steal” the work { from C= execution states and Task No. = data.size; GPU utilize their CPUs and GPUs Code pointer; Registers; GPU } GPU Unified Semantic Concept of Parallelism enables distributed heterogeneous C= programs are programming with a single executed concurrently by CPUs and GPUs parallel operator
Recommend
More recommend