CSL 860: Modern Parallel Computation Computation
Hello OpenMP #pragma omp parallel { // I am now thread i of n switch(omp_get_thread_num()) { Parallel case 0 : blah1.. Construct case 1: blah2.. } } } // Back to normal Extremely simple to use and incredibly powerful • Fork-Join model • Every thread has its own execution context • Variables can be declared shared or private •
Execution Model • Encountering thread creates a team: – Itself (master) + zero or more additional threads. • Applies to structured block immediately following – Each thread executes a copy of the code in {} • But, also see: Work-sharing constructs • There’s an implicit barrier at the end of block • Only master continues beyond the barrier • May be nested – Sometimes disabled by default
Memory Model • Notion of temporary view of memory – Allows local caching – Need to flush memory – T1 writes -> T1 flushes -> T2 flushes -> T2 reads – Same order seen by all threads – Same order seen by all threads • Supports threadprivate memory • Variables declared before parallel construct: – Shared by default – May be designated as private – n -1 copies of the original variable is created • May not be initialized by the system
Shared Variables • Heap allocated storage • Static data members • const-qualified (no mutable members) • Private: – Variables declared in a scope inside the construct Variables declared in a scope inside the construct – Loop variable in for construct • private to the construct • Others are shared unless declared private – You can change default • Arguments passed by reference inherit from original
Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section } }
Beware more of Compiler Re-ordering // Parallel construct { int b = initialSalary print(“Initial Salary was %d\n”, initialSalary); print(“Initial Salary was %d\n”, initialSalary); Book-keeping() // No read b or write initialSalary if (b < 10000) { raiseSalary(500); } }
Thread Control E nvironment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS omp_set_num_threads omp_get_max_threads Implementation * defined OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation defined OMP_NESTED omp_set_nested omp_get_nested false OMP_SCHEDULE Implementation * defined * Also see construct clause: num_threads, schedule
Parallel Construct #pragma omp parallel \ if(boolean) \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ default(shared | none) \ default(shared | none) \ shared(var1, var2), \ copyin(var2), \ reduction(operator:list) \ num_threads(n) { }
Parallel Loop #pragma omp parallel for for (i= 0; i < N; ++i) { blah … } • No of iterations must be known when the construct is encountered – Must be the same for each thread • Compiler puts a barrier at the end of parallel for – But see nowait
Parallel For #pragma omp for \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ lastprivate(var1, var2), \ lastprivate(var1, var2), \ reduction(operator: list), \ ordered, \ schedule(kind[, chunk_size]), \ nowait Canonical For Loop No loop break
Schedule(kind[, chunk_size]) Divide iterations into contiguous sets, chunks • – chunks are assigned transparently to threads static : iterations are divided among threads in a round-robin • fashion – When no chunk_size is specified, approximately equal chunks are made dynamic : iterations are assigned to threads in ‘request order’ dynamic : iterations are assigned to threads in ‘request order’ • – When no chunk_size is specified, it defaults to 1. guided : like dynamic, the size of each chunk is proportional to the • number of unassigned iterations divided by the number of threads – If chunk_size =k, chunks have at least k iterations (except the last) – When no chunk_size is specified, it defaults to 1. runtime: taken from environment variable •
Single #pragma omp parallel { #pragma omp for for( int i=0; i<N; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragma omp for #pragma omp for for(int i=0; i<N; i++ ) b[i] = x * f2(i); } Only one of the threads executes • Other threads wait for it • – unless NOWAIT is specified Hidden complexity • – Threads may be at different instructions
Sections #pragma omp sections { #pragma omp section { // do this … } #pragma omp section #pragma omp section { // do that … } // … } The omp section directives must be closely nested in a sections construct, • where no other work-sharing construct may appear.
Private Variables #pragma omp parallel private (size, …) for for ( int i = 0; i = numThreads; i++) { int size = numTasks/numThreads; int extra = numTasks – numThreads*size; if(i < extra) size ++; doTask(i, size, numThreads); } doTask(int start, int count) { // Each thread’s instance has its own activation record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t); } }
Firstprivate and Lastprivate • Initial value of private variable is unspecified – firstprivate initializes copies with the original – Once per thread (not once per iteration) – Original exists before the construct • Only the original copy is retained after the construct • lastprivate forces sequential-like behavior – thread executing the sequentially last iteration (or last listed section) writes to the original copy
Firstprivate and Lastprivate #pragma omp parallel for firstprivate( simple ) for (int i=0; i<N; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); } #pragma omp parallel for lastprivate( doneEarly ) for( i=0; (i<N || doneEarly; i++ ) { doneEarly = f0(i);
Other Synchronization Directives #pragma omp master { } – binds to the innermost enclosing parallel region binds to the innermost enclosing parallel region – Only the master executes – No implied barrier
Master Directive #pragma omp parallel { #pragma omp for for( int i=0; i<100; i++ ) a[i] = f0(i); Only master executes. Only master executes. #pragma omp master #pragma omp master No synchronization. x = f1(a); }
Critical Section #pragma omp critical accessBankBalance { } – A single thread at a time A single thread at a time – Applies to all threads – The name is optional; no name implies global critical region
Barrier Directive #pragma omp barrier – Stand-alone – Binds to inner-most parallel region – All threads in the team must execute • they will all wait for each other at this instruction • they will all wait for each other at this instruction • Dangerous: if (! ready ) #pragma omp barrier – Same sequence of work-sharing and barrier for the entire team
Ordered Directive #pragma omp ordered { } • Binds to inner-most enclosing loop • Binds to inner-most enclosing loop • The structured block executed in sequential order • The loop must declare the ordered clause • May encounter only one ordered regions
Flush Directive #pragma omp flush (var1, var2) – Stand-alone, like barrier – Only directly affects the encountering thread – List-of-vars ensures that any compiler re-ordering – List-of-vars ensures that any compiler re-ordering moves all flushes together
Atomic Directive #pragma omp atomic i++; • Light-weight critical section • Only for some expressions – x = expr (no mutual exclusion on expr evaluation) – x++ – ++x – x-- – --x
Reductions • Reductions are so common that OpenMP provides support for them • May add reduction clause to parallel for pragma • Specify reduction operation and reduction variable • OpenMP takes care of storing partial results in private variables and combining partial results after the loop
reduction Clause • reduction ( <op> : <variable> ) – + Sum – * Product – & Bitwise and – | Bitwise or – ^ ^ Bitwise exclusive or Bitwise exclusive or – && Logical and – || Logical or • Add to parallel for – OpenMP creates a loop to combine copies of the variable – The resulting loop may not be parallel
Nesting Restrictions • A work-sharing region may not be closely nested inside a work-sharing, critical, ordered, or master region. • A barrier region may not be closely nested inside a work- sharing, critical, ordered, or master region. • A master region may not be closely nested inside a work- sharing region. sharing region. • An ordered region may not be closely nested inside a critical region. • An ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause. • A critical region may not be nested (closely or otherwise) inside a critical region with the same name. Note that this restriction is not sufficient to prevent deadlock
EXAMPLES
OpenMP Matrix Multiply #pragma omp parallel for for(int i=0; i<n; i++ ) for( int j=0; j<n; j++ ) { c[i][j] = 0.0; for(int k=0; k<n; k++ ) for(int k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } • a, b, c are shared • i, j, k are private
Recommend
More recommend