openmp 4 0 and beyond
play

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is - PowerPoint PPT Presentation

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API & standard for shared memory parallel computing. Works with C, C++ and Fortran. It was first released in 1997, and version 4.5 was released


  1. OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC

  2. What is OpenMP? • OpenMP is an API & standard for shared memory parallel computing. • Works with C, C++ and Fortran. • It was first released in 1997, and version 4.5 was released in 2015. • Can now be used with accelerators such as GPUs, Xeon Phi & FPGA.

  3. The Basics • OpenMP API uses pragmas to tell the compiler what to parallelise. • OpenMP is commonly used to do fork-join parallelism.

  4. The Basics • All parallel code is performed inside a parallel region: #pragma omp parallel { //Parallel code goes here. } !$omp parallel !Parallel code goes here !$omp end parallel

  5. The Basics • The number of threads to use in a parallel region can be controlled in 3 ways: – export OMP_NUM_THREADS=x – void omp_set_num_threads(int x) – C: #pragma omp parallel num_threads(x) – FORTRAN: !$omp parallel num_threads(x)

  6. The Basics • The most common use case is a parallel loop: #pragma omp parallel for !$omp parallel do for(int i = 0; i < 100000; i++) do i=1, 100000 c[i] = a[i] + b[i]; c(i) = a(i) + b(i) end do !$omp end parallel do

  7. Data-Sharing Clauses • One of the most important things to get correct when using OpenMP is data clauses. • ALWAYS Start with #pragma omp parallel default(none) • Makes bugs less likely and easier to track down.

  8. Commonly Used Data-Sharing Clauses • shared : Allows a variable to be accessed by all of the threads inside a parallel region – care for race conditions. • private : Creates an uninitialized copy of the variable for each thread. At the end of the region, the data is lost. • reduction : Creates a copy of the variable for each thread, initialised depending on the type of reduction chosen. Examples options are +, *, -, & etc. At the end of the region, the original variable contains the reduction of all of the threads.

  9. Using Data-Sharing Clauses int i, sum, a[100000]; Integer, Dimension(100000)::a,b,c int b[100000], c[100000]; Integer :: i, sums #pragma omp parallel for default(none) \ !$omp parallel do default(none) & shared(a,b,c) private(i) reduction(+,sum) !$omp shared(a, b, c) private(i) & { !$omp reduction(+,sums) for(i = 0; i < 100000; i++) { do i=1,100000 c[i] = a[i] + b[i]; c(i) = b(i) + a(i) sum += c[i]; sums = sums + c(i) } end do } !$omp end parallel do

  10. Controlling Loop Scheduling • OpenMP also allows the user to specify how the loop is executed, using the schedule option. • The default option is static . The loop is broken into nr_threads equal chunks, and each thread executes a chunk. • You can specify the size of the chunks manually, static(100) will create chunks of size 100. • Other options: guided, dynamic, auto, runtime. • Usually static or guided will give best performance.

  11. Controlling Loop Scheduling • The other commonly used options are: • guided, chunksize : The iterations are assigned to threads in chunks. Each chunk will be proportional to the number of remaining iterations, and no less than the chunk size. • dynamic, chunksize : The iterations are distributed to threads in chunks. Each thread executes a chunk, then requests another chunk once it has completed. • Usually static or guided will give best performance.

  12. Controlling Loop Scheduling #pragma omp parallel for default(none) \ !$omp parallel do default(none) & shared(a, b, c) private(i) \ !$omp shared(a, b, c) private(i) & schedule(guided,1000) !$omp schedule(dynamic, 1000) { do i=1,100000 for(i = 0; i < 100000; i++) { c(i) = b(i) + a(i) c[i] = a[i] + b[i]; end do } !$omp end parallel do }

  13. Thread ID • Each thread has its own ID, which can be retrieved with omp_get_thread_num() • The total number of threads can also be retrieved with omp_get_num_threads()

  14. First exercise - setup • Copy the exercises from – cp /home/aidanchalk/md.tar.gz . • cp /home/aidanchalk/OpenMP_training.pdf . • Extract them to a folder: – tar – xvf md.tar.gz • Load the intel compiler using source /opt/intel/parallel_studio_xe_2017.2.050/psxevars.sh • Compile the initial code with make • Test it on Xeon using the jobscript: – qsub initial.pbs • Check the the output by looking at output.txt. Record the runtime.

  15. First exercise • Copy the original code to ex_1: – cp initial/md.* ex_1/. • Add OpenMP loop-based parallelism to the compute_step and update routines. • To build it use make ex1 • Test it on Xeon and on Xeon Phi KNC: – Xeon (copy from /home/aidanchalk/ex1_xeon.pbs): qsub ex1_xeon.pbs – Xeon Phi: qsub ex1_phi.pbs • How does the runtime compare to the original serial version? The runtimes.out file contains the runtime on each number of cores from 1 to 32.

  16. First exercise • Add schedule(runtime) to the OpenMP loop in compute_step, and add export OMP_SCHEDULE=“guided,8” to the jobscript, and compare the runtime. • Try other schedules (dynamic, auto) and other chunksizes to see how if affects the runtime.

  17. First exercise – potential issue. • If the performance is worse, use make ex1_opt and check the optrpt to see if the compute_step loop was vectorised. • If not, write the loop without using a reduction variable, this should allow the compiler to vectorise the code and get better performance than the serial version.

  18. Task-Based Parallelism • Task-Based Parallelism is an alternative methodology for shared-memory parallel computing. • Rather than managing threads explicitly, we break the work into parallelisable chunks, known as tasks . • Between tasks, we keep track of data flow/dependencies (and potential race conditions). • With this information, we can safely execute independent tasks in parallel.

  19. Diagram of TBP

  20. Diagram of TBP

  21. Diagram of TBP

  22. OpenMP Tasks • OpenMP added tasks in 3.0, and additions to them have been including in both 4.0 and 4.5. • The earliest addition was the task keyword. • OpenMP 4.5 added an easier option – the taskloop . • In OpenMP, tasks are not (usually) executed until the next barrier or unless you use a taskwait barrier.

  23. Taskloop • Used to execute a loop using tasks. • Note: taskloop is not a worksharing construct (like OpenMP for) – you need to run it inside a single region unless you want to perform the loop multiple times. • You can define either num_tasks or grainsize to control the amount of work in each task. • (gcc-6.3 & gcc-7 bug – always use for(int i =0,….) for taskloop).

  24. Taskloop #pragma omp parallel default(none) \ !$omp parallel default(none) & shared(a, b, c) !$omp shared(a, b, c) private(i) { !$omp single #pragma omp single !$omp taskloop num_tasks(1000) #pragma omp taskloop grainsize(1000) do i=1,100000 for(int i = 0; i < 100000; i++) c(i) = b(i) + a(i) { end do c[i] = a[i] + b[i]; !$omp end taskloop } !$omp end single } !$omp end parallel

  25. Second Exercise • Create a new directory called ex_2 and copy the ex_1/md.XXX files to it. • Alter your implementation of compute_step to use taskloop rather than the do/for loop you used before. • Note – you can’t use reduction variable for the energy with taskloop. • Build with make ex2 • How does the runtime compare to your previous version? • How does altering the grain_size (or number of tasks) affect your results?

  26. Second Exercise • If your new code is substantially slower, use make ex2_opt and look at the optimization report. • If the code doesn’t vectorise, avoid updating the arrays directly in the inner loop – instead sum to a temporary and sum to the array in the outer loop.

  27. Explicit Tasks • Taskloop is helpful for just using tasks with an unbalanced loop, however sometimes we want more control over how our tasks are spawned. • We can spawn tasks ourselves, using the task pragma. • To create an explicit task, we put the task pragma around a section of code we want to have executed as a task, and apply the relevant data-sharing clauses. • The firstprivate clause is useful: Any task-private data that needs to be input to a task region should be declared as firstprivate.

  28. Explicit Tasks • Usually we will spawn tasks in a single region (and certain OpenMP definitions will only work if we do). • If we have completely independent tasks, we may be better spawning them inside a parallel for. • Note: we cannot use reduction variables inside tasks (this is in discussion for OpenMP 5.0).

  29. Explicit Tasks #pragma omp parallel default(none) \ !$omp parallel default(none) & shared(a, b, c) private(i) !$omp shared(a, b, c) private(i,j) { !$omp do #pragma omp single do i=1,100000,1000 for(i = 0; i < 100000; i+=1000) { !$omp task default(none) & #pragma omp task default(none) \ !$omp shared(a, b, c) private(j) & shared(a,b,c) firstprivate(i) !$omp firstprivate(i) for(int j = 0; j < 1000; j++) do j=0, 999 c[i+j] = a[i+j] + b[i+j]; c(i+j) = b(i+j) + a(i+j) } end do } !$omp end task end do !$omp end do !$omp end parallel

  30. Third Exercise • Create a new folder (ex_3) and copy the original files (initial/md.XX) to ex_3/md.XX • Break down the outer loop in the compute_step and create explicit tasks. Copy your code from ex_1 to parallelise the update function. • Build with make ex3 • How does the runtime compare to your previous versions. • What size tasks perform best for explicit tasks? • Parallelise the update function using explicit tasks. • Does this improve the performance?

Recommend


More recommend