introduce parallelism
play

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | - PowerPoint PPT Presentation

Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18 PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to


  1. Introduce parallelism OpenMP & OpenACC March 5-9, 2018 | Santiago de Compostela Spain #cesgahack18

  2. PARALLWARE SW DEVELOPMENT CYCLE Understanding the sequential code and profiling a. Analyze your code b. Focus on profiling and where to parallelize your code correctly Identifying opportunities for parallelization c. Figuring out where the code is suitable for parallelization d. Often the hardest step! Introduce parallelism e. Decide how to implement the parallelism discovered in your code Test the correctness of your parallel implementation f. Compile & run the parallel versions of your code to check that the numerical result is correct Test the performance of your parallel implementation g. Run the parallel versions of your code to measure performance increase for real-world workloads Performance tuning h. Repeat steps 1-5 until you meet your performance requirements.... 2

  3. Race Conditions Race conditions are programmer’s nightmare ● Make the result of your parallel code unpredictable. ● What are “race conditions”? How can we handle “race conditions”? ● What is the value of variable “x” at the end of the parallel region? x=0; #pragma omp parallel { #pragma omp for for (int i=0; i<N; i++ ) { x = x + 1; } } 3

  4. Race Conditions Scenario 1 What is the value of variable “x” at the end of the parallel region? Thread0 (“x=x+1”) finishes before Thread1 begins its computation (“x=x+1”) and the value is “x=2” x=0; #pragma omp parallel Thread0: r1=0+1 { Thread0: x=r1 #pragma omp for Thread1: r2=1+1 for (int i=0; i<N; i++ ) { Thread1: x=r2 x = x + 1; } } Correct result!

  5. Race Conditions Scenario 2 What is the value of variable “x” at the end of the parallel region? Thread0 (“x=x+1”) does not finish before Thread1 begins (“x=x+1”) and the value is “x=1” x=0; #pragma omp parallel { Thread0: r1=0+1 #pragma omp for Thread1: r2=0+1 for (int i=0; i<N; i++ ) { Thread0: x=r1 x = x + 1; } Thread1: x=r2 } Wrong result!

  6. A Code-Oriented Approach SHARING Read-only variables that save input data. DEFINITION OF THE PRIVATIZATION Typically code parameters. PARALLEL REGION Variables that store thread-local Identify the code fragment that temporary results. Typically loop can be executed concurrently. temporaries whose value can be Typically for loops. discarded at the end of the w = 1/N; parallel region. /* Compute the sum */ sum = 0.0; for( i=0; i<N; i++ ) { x = (i - 0.5) / N; sum = sum + REDUCTION ( 4.0 / ( 1+ x 2 ) ); Identify computations with WORK SHARING } associative, commutative Map computational workload operators that require /* Compute value PI */ to threads. pi = sum * w; additional synchronization. Typically loop iterations Typically sum, product, max, mapped in block or cyclic min. manner. 6

  7. OpenMP Execution Model Exploit parallelism across several Hw processor ● Exploit parallelism within one Sw process ● Master process is split into parallel thread. ○ Information shared via shared memory. ○

  8. OpenMP Execution Model Host-driven execution: HPC App is split into parallelizable phases ● Master thread starts and finishes execution. ● Parallelism is exploited within each phase. ●

  9. GPU Execution Model CODE TRANSFERS DATA TRANSFERS Stream Programming Model

  10. GPU Execution Model OpenACC OpenMP Coarse-grain: gangs teams distribute ● Fine-grain: workers parallel ● Finest-grain: vector simd ● #Blocks #Threads/Block #WarpThreads

  11. Case study: Algorithm ∏ Aproximación del valor de ∏ mediante la ● integración de 4/(1+x 2 ) en el intervalo [0,1]. Dividir el intervalo en N subintervalos de ● longitud 1/N : Para cada subintervalo se calcula el ■ área del rectángulo cuya altura es el valor de 4/(1+x 2 ) en su punto medio La suma de las áreas de los N rectángulos ● sum = 0.0 aproxima el área bajo la curva. for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) La aproximación de ∏ es más precisa cuando ● } N → ∞.d pi = sum / N

  12. Parallelization of Algorithm ∏ Stage 1: N , P Broadcast N , P ■ N , P N , P N , P Stage 2: N p =N/P N p =N/P N p =N/P N p =N/P Distribute loop iterations ■ Compute partial sums S p at each ■ processor Stage 3: S 0 S 1 S 3 S 2 Gather all partial sums ■ S Compute global sum ■ S = S 0 +…+S p-1 Parallel Programming Framework

  13. OpenMP sum = 0.0 #pragma omp parallel shared (N) private(i)\ sum = 0.0 private(sum_aux) #pragma omp parallel for reduction(+:sum) for( i=0; i<N; i++ ) { { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) sum_aux = 0; ) #pragma for schedule(static) } for( i=0; i<N; i++ ) { pi = sum / N sum_aux = sum_aux + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) } #pragma atomic sum = sum + sum_aux; } pi = sum / N sum = 0.0 #pragma omp parallel shared (N) private(i)\ reduction(+:sum) { sum = 0.0 #pragma for schedule(static) #pragma omp parallel for for( i=0; i<N; i++ ) { for( i=0; i<N; i++ ) { sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) #pragma omp atomic sum = sum + ( 4.0 / ( 1+( (i - 0.5)/N ) 2 ) ) } } } pi = sum / N pi = sum / N

  14. OpenACC Definition of Work sharing parallel region Privatization Reduction

  15. MPI Data scoping PRIVATE, variable Definition declarations of parallel are region process-local Data scoping REDUCTION, with implicit message passing to compute the approximation of number PI

  16. CUDA Worksharing Definition of parallel Data scoping region (private, reduction) with implicit synchronizatio n

  17. Data Scoping Temporary variables ● private, firstprivate, lastprivate, OpenMP: ○ thread-local declaration inside parallel region c/c++ not fortran private, firstprivate, lastprivate, create, copyin OpenACC: ○ Read-only variables ● shared OpenMP: ○ OpenACC: ○ copyin Output variables ● shared foralls, reduction/atomic/critical reductions OpenMP: ○ OpenACC: ○ copy, copyout, reduction/atomic reductions Work-sharing ● schedule(static), schedule(static,1), schedule(dynamic) OpenMP: ○ OpenACC: ○ Hardware controlled 17

Recommend


More recommend