gpu accelerated particle in cell code on minsky
play

GPU-Accelerated Particle-in-cell Code on Minsky IWOPH17, ISC, - PowerPoint PPT Presentation

Member of the Helmholtz Association GPU-Accelerated Particle-in-cell Code on Minsky IWOPH17, ISC, Frankfurt a. M. Andreas Herten , Forschungszentrum Jlich, 22 July 2017 Member of the Helmholtz Association Outline Andreas Herten | GPU-PiC on


  1. Use CUDA Fortran Member of the Helmholtz Association • Translation to CUDA Fortran kernel • Helper data (scalars, 3D vectors) handled by OpenACC • Particle pos., mom. via CUDA • GPU -compatible through preprocessor guards CUDA Fortran + OpenACC !$acc enter data copyin(list_of_particles, xyzl, di, dqs, one1, one2, lbounds, ubounds) CUDA Fortran !$acc host_data use_device(list_of_particles, e, b, xyzl, di, dqs, one1, one2, lbounds, ubounds) • All data (incl. large arrays) handled by OpenACC • Few code changes necessarry FUDA + Pinned OpenACC type(particle_type), dimension(:), allocatable :: call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, 1)>>>(...) necessary! • Few Fortran features used ( arrays… ) • Two changes necessary Initial OpenACC Port !$acc parallel loop private(pp,root,qi,mi,wi) present(e,b) copy(list_of_particles) • Not running! • Breaks at first encounter Working OpenACC Port x_(1)=list_of_particles(i_particle)%vec(1) x_(2)=list_of_particles(i_particle)%vec(2) – Unroll some array attributes(pinned) :: list_of_particles operations – Limit number of gang/vector ( slow! ) style and complex kernel challenging for OpenACC compiler Fast OpenACC • Rewrite of entire computing kernel list_of_particles • Pinned host data (linked list) temporarily GPU Intel Xeon E5- 2650 (2 GHz) Sandy Bridge GPU GPU Status of Acceleration Effective bandwidths vs. clock frequencies for K40, K80, P100 CPU Kernel duration vs. information exchange • Information exchanged for kernel • Effective bandwidth: K80 – 100 GB/s; P100 – 3 17 GB/s • K80: Two regions ( left : performance depending on clock; right : nearly constant ) • P100: JuSPIC benefits from new GPU design GPU Andreas Herten | GPU-PiC on Minsky | 22 July 2017 ECC on • Faster data staging memory access CUDA Fortran , SoA type :: posmom real, dimension(:), allocatable :: x, y, z, px, … end type posmom type(posmom) :: soa_list_of_particles real, dimension(:), allocatable, device :: d_x, … • Structure-of-Array data type for coalesced • Allocated once, resized K40 dynamically • Speedup single CPU: 24× Based on information exchange: t(N Part ) = α + I(N Part )/β Speedup: Kernel (only compute) w/r/t CPU loop (single core) Full pusher (incl. all overhead) w/r/t initial OpenACC GPU Nvidia Tesla OpenACC Pusher moved to array • Original data structure • First progress made in GPU-acceleration of JuSPIC Andreas Herten , Dirk Pleiter, Dirk Brömmel Jülich Supercomputing Centre Accelerating Plasma Physics with GPUs Jülich Scalable Particle-in-Cell Code Techniques Workflow Particle Pusher Conclusion • Hybrid code: OpenACC and CUDA Fortran arbitrary number of • Changes in data layout necessary (expensive!) • Benefit from P100 architecture OpenACC and Fortran • Support through PGI compiler • Well-supported, example: !$acc parallel loop private(pp,root,qi,mi,wi) disabled) particle species • Cartesian geometry; ! … Mitglied der Helmholtz-Gemeinschaft Complicated structures At start of porting: Pusher kernel too complicated for parsing by compiler — Large routine (many registers) — Structured data types (with alloctables) Long investigation to get runnable code Good performance complicated Reported in other publication (beyond scope here, appendix) Performance Model domain scheme Status of Porting and Acceleration JuSPIC Conclusion & Outlook • Based on plasma simulation code PSC (by H. Ruhl) • 3D electromagnetic Particle-in-Cell code • Solves relativistic Vlasov equation, coupled to Maxwell equations in finite-difference time- do i=1, n present(e,b) end do • Scales to full JUQUEEN JuSPIC with OpenACC and CUDA Fortran !$acc end parallel • Fully distributed with MPI – Domain decomposition: tiles • CPU -parallelized with OpenMP – Local decomposition: slices – A, B processed independently supercomputer • Parallelization on slice / tile level Three parts: • Solve Maxwell equations with OpenACC (not shown here) • Update of particle position & momentum ( pusher ) • Update of densities ( reduction ) GPU GPU CPU Version • Single core ( OpenMP • Parallelization on multiple GPUs • Modern Fortran • Evaluate data layout change for rest of JuSPIC i = blockDim % x * blockIdx % x + • JuSPIC : Many issues during parallelization • Structured datatypes and array operations challenging for compiler • Why not use CUDA Fortran ? • Lowering of overhead of data layout transformations #ifdef _CUDA • Portable with preprocessor guards! threadIdx % x #else • Minimization of host/devices copies • Reduction on GPU Outlook # 12 31 ! … #endif do i=0,N Clock fixed to max. value → Is lower limit of exploited bandwidth • GPU kernel possibly latency-limite d (many registers) — Operations on whole fields (it’s Fortran afuer all) → overhead → Many manual code adaptions → Fortran programming

  2. Member of the Helmholtz Association OpenACC Pusher Complicated structures At start of porting: Pusher kernel too complicated for parsing by compiler — Large routine (many registers) — Structured data types (with alloctables) Long investigation to get runnable code Good performance complicated Reported in other publication (beyond scope here, appendix) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 12 31 — Operations on whole fields (it’s Fortran afuer all) → Use CUDA Fortran

  3. Member of the Helmholtz Association CUDA Fortran Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 13 31 Acceleration for GPUs

  4. attributes(device) :: list_of_particles_d list_of_particles_d = list_of_particles attributes(global) subroutine gpupusher(list_of_particles, ...) — Copy to device Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1)>>>(list_of_particles_d, ...) call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, — Call kernel — Define kernel Member of the Helmholtz Association Introduction to CUDA Fortran list_of_particles, list_of_particles_d type (particle_type), dimension (slice(1)%n) :: — Define device function along-side host function Examples (from JuSPIC): It’s like CUDA C/C++,… but for Fortran # 14 31 Available in PGI Fortran compiler Adds CUDA extensions to Fortran

  5. attributes(device) :: list_of_particles_d list_of_particles_d = list_of_particles attributes(global) subroutine gpupusher(list_of_particles, ...) — Copy to device Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1)>>>(list_of_particles_d, ...) call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, — Call kernel — Define kernel Member of the Helmholtz Association Introduction to CUDA Fortran list_of_particles, list_of_particles_d type (particle_type), dimension (slice(1)%n) :: — Define device function along-side host function Examples (from JuSPIC): It’s like CUDA C/C++,… but for Fortran # 14 31 Available in PGI Fortran compiler Adds CUDA extensions to Fortran

  6. list_of_particles_d = list_of_particles attributes(global) subroutine gpupusher(list_of_particles, ...) Member of the Helmholtz Association attributes(device) :: list_of_particles_d Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1)>>>(list_of_particles_d, ...) call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, — Call kernel — Define kernel — Copy to device # 14 31 Introduction to CUDA Fortran list_of_particles, list_of_particles_d type (particle_type), dimension (slice(1)%n) :: — Define device function along-side host function Examples (from JuSPIC): It’s like CUDA C/C++,… but for Fortran Available in PGI Fortran compiler Adds CUDA extensions to Fortran ֒ →

  7. attributes(global) subroutine gpupusher(list_of_particles, ...) Member of the Helmholtz Association attributes(device) :: list_of_particles_d Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1)>>>(list_of_particles_d, ...) call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, — Call kernel — Define kernel list_of_particles_d = list_of_particles — Copy to device # 14 31 Introduction to CUDA Fortran list_of_particles, list_of_particles_d type (particle_type), dimension (slice(1)%n) :: — Define device function along-side host function Examples (from JuSPIC): It’s like CUDA C/C++,… but for Fortran Available in PGI Fortran compiler Adds CUDA extensions to Fortran ֒ →

  8. Member of the Helmholtz Association attributes(device) :: list_of_particles_d Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1)>>>(list_of_particles_d, ...) call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, — Call kernel attributes(global) subroutine gpupusher(list_of_particles, ...) — Define kernel list_of_particles_d = list_of_particles — Copy to device # 14 31 Introduction to CUDA Fortran list_of_particles, list_of_particles_d type (particle_type), dimension (slice(1)%n) :: — Define device function along-side host function Examples (from JuSPIC): It’s like CUDA C/C++,… but for Fortran Available in PGI Fortran compiler Adds CUDA extensions to Fortran ֒ →

  9. Member of the Helmholtz Association Introduction to CUDA Fortran Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1)>>>(list_of_particles_d, ...) call gpupusher<<<dim3(nBlocks, 1, 1), dim3(nThreads, 1, — Call kernel attributes(global) subroutine gpupusher(list_of_particles, ...) — Define kernel list_of_particles_d = list_of_particles — Copy to device attributes(device) :: list_of_particles_d list_of_particles, list_of_particles_d type (particle_type), dimension (slice(1)%n) :: — Define device function along-side host function Examples (from JuSPIC): It’s like CUDA C/C++,… but for Fortran # 14 31 Available in PGI Fortran compiler Adds CUDA extensions to Fortran ֒ → ֒ →

  10. OpenACC mixes well together with CUDA Fortran Member of the Helmholtz Association #ifdef _CUDA Andreas Herten | GPU-PiC on Minsky | 22 July 2017 #endif do i = lbound(a, 1), ubound(a, 1) #else i = blockDim%x * (blockIdx%x - 1) + threadIdx%x !$acc enter data copyin(list_of_particles, ...) 2 Use pre-processor directives for rest CUDA Fortran Portability 1 Use OpenACC as much as possible, e.g. for data movements … but can be mitigated! Portability sufgers… Not as portable as OpenACC, but it’s alright # 15 31 CUDA Fortran : more powerful approach

  11. Member of the Helmholtz Association 2 Use pre-processor directives for rest Andreas Herten | GPU-PiC on Minsky | 22 July 2017 #endif do i = lbound(a, 1), ubound(a, 1) #else i = blockDim%x * (blockIdx%x - 1) + threadIdx%x #ifdef _CUDA !$acc enter data copyin(list_of_particles, ...) CUDA Fortran Portability 1 Use OpenACC as much as possible, e.g. for data movements … but can be mitigated! Portability sufgers… Not as portable as OpenACC, but it’s alright # 15 31 CUDA Fortran : more powerful approach OpenACC mixes well together with CUDA Fortran

  12. Member of the Helmholtz Association 2 Use pre-processor directives for rest Andreas Herten | GPU-PiC on Minsky | 22 July 2017 #endif do i = lbound(a, 1), ubound(a, 1) #else i = blockDim%x * (blockIdx%x - 1) + threadIdx%x #ifdef _CUDA !$acc enter data copyin(list_of_particles, ...) CUDA Fortran Portability 1 Use OpenACC as much as possible, e.g. for data movements … but can be mitigated! Portability sufgers… Not as portable as OpenACC, but it’s alright # 15 31 CUDA Fortran : more powerful approach OpenACC mixes well together with CUDA Fortran

  13. Member of the Helmholtz Association Data Layout Analysis Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 16 31 Acceleration for GPUs

  14. Member of the Helmholtz Association Kernel Run kernel Andreas Herten | GPU-PiC on Minsky | 22 July 2017 Benchmarking on JURON linked list F2LL Copy flat field back to (synchronization, etc.) Other Lefu-over time device to host D2H Copy data from to device Strategies for Data Layout H2D Copy data from host data structure to field LL2F Convert linked-list data structures Allocate Allocate host-side Sub-parts of Pusher: Benchmark difgerent data layouts and transfer strategies Because data is not solely data # 17 31 ∑ Everything

  15. SoA Data copied with Fortran , but instead of one field with all particle data, one field for each spatial and momentum component for Member of the Helmholtz Association 72 82 91 380 9440 Exp 2 9695 564 527 79 83 7973 108 353 SoA 7811 1 844 66 77 53 376 6386 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 80 10435 – Data Layout Experiments Description of experiments Exp 1 As Initial , but data copied with OpenACC copy directives Exp 2 As Exp 1 , but data copied from pinned host memory particles in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 Exp 1 # 18 31 Initial All particles stored in single field, one particle afuer another; data copied to/from GPU with Fortran (baseline)

  16. SoA Data copied with Fortran , but instead of one field with all particle data, one field for each spatial and momentum component for Member of the Helmholtz Association 72 80 82 91 380 9440 Exp 2 9695 564 527 79 83 7973 108 – SoA 7811 1 844 66 77 53 376 6386 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 353 Exp 1 10435 Data Layout Experiments Description of experiments Exp 1 As Initial , but data copied with OpenACC copy directives Exp 2 As Exp 1 , but data copied from pinned host memory particles in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 # 18 31 Initial All particles stored in single field, one particle afuer another; data copied to/from GPU with Fortran (baseline) ∑

  17. SoA Data copied with Fortran , but instead of one field with all particle data, one field for each spatial and momentum component for Member of the Helmholtz Association 72 80 82 91 380 9440 Exp 2 9695 564 527 79 83 7973 108 – SoA 7811 1 844 66 77 53 376 6386 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 353 Exp 1 10435 Data Layout Experiments Description of experiments Exp 1 As Initial , but data copied with OpenACC copy directives Exp 2 As Exp 1 , but data copied from pinned host memory particles in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 # 18 31 Initial All particles stored in single field, one particle afuer another; data copied to/from GPU with Fortran (baseline) ∑

  18. SoA Data copied with Fortran , but instead of one field with all particle data, one field for each spatial and momentum component for Member of the Helmholtz Association 72 80 82 91 380 9440 Exp 2 9695 564 527 79 83 7973 108 – SoA 7811 1 844 66 77 53 376 6386 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 353 Exp 1 10435 Data Layout Experiments Description of experiments Exp 1 As Initial , but data copied with OpenACC copy directives Exp 2 As Exp 1 , but data copied from pinned host memory particles in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 # 18 31 Initial All particles stored in single field, one particle afuer another; data copied to/from GPU with Fortran (baseline) ∑

  19. Member of the Helmholtz Association 83 353 80 82 91 380 9440 Exp 2 9695 564 527 79 72 10435 108 7973 SoA 7811 1 844 66 77 53 376 6386 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 – Exp 1 Data Layout Experiments D2H Description of experiments Exp 1 As Initial , but data copied with OpenACC copy directives Exp 2 As Exp 1 , but data copied from pinned host memory particles in μ s Allocate LL2F 6885 Kernel H2D Others F2LL Initial 8040 – 567 82 84 62 350 # 18 31 Initial All particles stored in single field, one particle afuer another; data copied to/from GPU with Fortran (baseline) SoA Data copied with Fortran , but instead of one field with all particle data, one field for each spatial and momentum component for ∑

  20. SoA: fastest, looking (also) at raw GPU runtimes Exp 1: also ok for raw GPU times, but large F2LL overhead (more Member of the Helmholtz Association 844 564 527 79 83 72 108 7973 SoA 7811 1 77 66 Exp 2 53 376 6386 – but slowest for change of data structures (six fields vs. one) Exp 2: least overhead; pinned memory allows for direct data access – but allocation overhead is not fully resolved on that later) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 9695 9440 Data Layout Experiments – Discussion of results in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 567 380 82 84 62 350 6885 Exp 1 10435 – 353 80 82 91 # 19 31 ∑

  21. Exp 1: also ok for raw GPU times, but large F2LL overhead (more Member of the Helmholtz Association 844 9695 564 527 79 83 72 108 7973 SoA 7811 1 66 Data Layout Experiments 77 53 376 6386 – but slowest for change of data structures (six fields vs. one) Exp 2: least overhead; pinned memory allows for direct data access – but allocation overhead is not fully resolved on that later) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 Exp 2 9440 380 91 Discussion of results in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 Exp 1 10435 – 353 80 82 # 19 31 ∑ SoA : fastest, looking (also) at raw GPU runtimes

  22. Exp 1: also ok for raw GPU times, but large F2LL overhead (more Member of the Helmholtz Association 1 9695 564 527 79 83 72 108 7973 SoA 7811 66 844 9440 77 53 376 6386 change of data structures (six fields vs. one) Exp 2: least overhead; pinned memory allows for direct data access – but allocation overhead is not fully resolved on that later) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 Exp 2 380 Data Layout Experiments 91 Discussion of results in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 Exp 1 10435 – 353 80 82 # 19 31 ∑ SoA : fastest, looking (also) at raw GPU runtimes – but slowest for

  23. Exp 1: also ok for raw GPU times, but large F2LL overhead (more Member of the Helmholtz Association 1 9695 564 527 79 83 72 108 7973 SoA 7811 66 844 9440 77 53 376 6386 change of data structures (six fields vs. one) Exp 2 : least overhead; pinned memory allows for direct data access – but allocation overhead is not fully resolved on that later) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 Exp 2 380 Data Layout Experiments 91 Discussion of results in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 Exp 1 10435 – 353 80 82 # 19 31 ∑ SoA: fastest, looking (also) at raw GPU runtimes – but slowest for

  24. Exp 1: also ok for raw GPU times, but large F2LL overhead (more Member of the Helmholtz Association 7811 Exp 2 9695 564 527 79 83 72 108 7973 SoA 844 1 Data Layout Experiments 66 77 53 376 6386 change of data structures (six fields vs. one) Exp 2 : least overhead; pinned memory allows for direct data access – but allocation overhead is not fully resolved on that later) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 9440 380 91 82 Discussion of results in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 Exp 1 10435 – 353 80 # 19 31 ∑ SoA: fastest, looking (also) at raw GPU runtimes – but slowest for

  25. Member of the Helmholtz Association 7811 Exp 2 9695 564 527 79 83 72 108 7973 SoA 1 Data Layout Experiments 844 66 77 53 376 6386 change of data structures (six fields vs. one) Exp 2: least overhead; pinned memory allows for direct data access – but allocation overhead is not fully resolved on that later) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 9440 380 91 82 Discussion of results in μ s Allocate LL2F H2D Kernel D2H Others F2LL Initial 8040 – 567 82 84 62 350 6885 Exp 1 10435 – 353 80 # 19 31 ∑ SoA: fastest, looking (also) at raw GPU runtimes – but slowest for Exp 1 : also ok for raw GPU times, but large F2LL overhead (more

  26. Member of the Helmholtz Association 208 229 232 764 – 4687 Exp 1 2600 736 229 804 267 908 – 4956 Initial JUHYDRA 6386 376 198 2455 77 4880 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 2674 827 173 208 204 786 1 SoA Exp 2 2651 23 192 230 224 1027 577 5328 53 66 Data Layout Experiments JURON 350 62 84 82 567 – 8040 Initial F2LL 844 Others D2H Kernel H2D LL2F Allocate in μ s Architecture Comparison 6885 Exp 1 10435 – 1 7811 SoA 7973 108 72 83 79 527 564 9695 Exp 2 9440 380 91 82 80 353 # 20 31 ∑

  27. Member of the Helmholtz Association 208 229 232 764 – 4687 Exp 1 2600 736 229 804 267 908 – 4956 Initial JUHYDRA 6386 376 198 2455 77 4880 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 2674 827 173 208 204 786 1 SoA Exp 2 2651 23 192 230 224 1027 577 5328 53 66 Data Layout Experiments JURON 350 62 84 82 567 – 8040 Initial F2LL 844 Others D2H Kernel H2D LL2F Allocate in μ s Architecture Comparison 6885 Exp 1 10435 – 1 7811 SoA 7973 108 72 83 79 527 564 9695 Exp 2 9440 380 91 82 80 353 # 20 31 ∑ 2 . 8 ×

  28. Member of the Helmholtz Association 208 229 232 764 – 4687 Exp 1 2600 736 229 804 267 908 – 4956 Initial JUHYDRA 6386 376 198 2455 Data Layout Experiments 4880 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 2674 827 173 208 204 786 1 SoA Exp 2 2651 23 192 230 224 1027 577 5328 53 77 66 JURON 350 62 84 82 567 – 8040 Initial F2LL Exp 1 Others D2H Kernel H2D LL2F Allocate in μ s Architecture Comparison 844 6885 10435 527 1 7811 SoA 7973 108 72 83 – 79 564 9695 Exp 2 9440 380 91 82 80 353 # 20 31 ∑ 3 . 2 × 2 . 8 ×

  29. Member of the Helmholtz Association 208 229 232 764 – 4687 Exp 1 2600 736 229 804 267 908 – 4956 Initial JUHYDRA 6386 376 198 2455 Data Layout Experiments 4880 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 2674 827 173 208 204 786 1 SoA Exp 2 2651 23 192 230 224 1027 577 5328 53 77 66 JURON 350 62 84 82 567 – 8040 Initial F2LL Exp 1 Others D2H Kernel H2D LL2F Allocate in μ s Architecture Comparison 844 6885 10435 527 1 7811 SoA 7973 108 72 83 – 79 564 9695 Exp 2 9440 380 91 82 80 353 # 20 31 ∑ 3 . 2 × 2 . 8 × 0 . 6 ×

  30. Member of the Helmholtz Association 208 229 232 764 – 4687 Exp 1 2600 736 229 804 267 908 – 4956 Initial JUHYDRA 6386 376 198 2455 77 4880 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 2674 827 173 208 204 786 1 SoA Exp 2 2651 23 192 230 224 1027 577 5328 Data Layout Experiments 53 66 JURON 844 62 84 82 567 – 8040 Initial F2LL Exp 1 Others D2H Kernel H2D LL2F Allocate in μ s Architecture Comparison 6885 350 10435 527 1 7811 SoA 7973 108 72 – 79 83 564 9695 Exp 2 9440 380 91 82 80 353 # 20 31 ∑ 3 . 2 × 0 . 3 × 2 . 8 × 0 . 6 ×

  31. Member of the Helmholtz Association 208 229 232 764 – 4687 Exp 1 2600 736 229 804 267 908 – 4956 Initial JUHYDRA Data Layout Experiments 376 198 2455 77 4880 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 2674 827 173 208 204 786 1 SoA Exp 2 2651 23 192 230 224 1027 577 5328 53 6386 66 JURON 350 62 844 82 567 – 8040 Initial F2LL Exp 1 Others D2H Kernel H2D LL2F Allocate in μ s Architecture Comparison 6885 84 10435 527 1 7811 SoA 7973 – 72 83 79 108 564 82 353 9695 80 # 20 31 91 380 9440 Exp 2 ∑ 3 . 2 × ? ! y h W 0 . 3 × 2 . 8 × 0 . 6 ×

  32. Member of the Helmholtz Association Data Layout Conversion Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 21 31 Acceleration for GPUs

  33. list%tail%next%particle = particle list%tail => list%tail%next Member of the Helmholtz Association allocate (list%tail%next) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1 Start with first particle, progress along links, remove each particle Benchmark nullify (list%tail%next%next) add_one_to_list Conversion of Data Layouts — … add each particle to linked list, update pointers — Loop through field(s) of particle information… — Initialize new, empty linked list of particles — Kill old linked list of particles 1 Parts of F2LL Why is F2LL so slow? # 22 31

  34. Member of the Helmholtz Association Conversion of Data Layouts Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1 Start with first particle, progress along links, remove each particle Benchmark nullify (list%tail%next%next) allocate (list%tail%next) add_one_to_list — … add each particle to linked list, update pointers — Loop through field(s) of particle information… — Initialize new, empty linked list of particles — Kill old linked list of particles 1 Parts of F2LL Why is F2LL so slow? # 22 31 list%tail%next%particle = particle list%tail => list%tail%next

  35. Member of the Helmholtz Association Conversion of Data Layouts Andreas Herten | GPU-PiC on Minsky | 22 July 2017 1 Start with first particle, progress along links, remove each particle nullify (list%tail%next%next) allocate (list%tail%next) add_one_to_list — … add each particle to linked list, update pointers — Loop through field(s) of particle information… — Initialize new, empty linked list of particles — Kill old linked list of particles 1 Parts of F2LL Why is F2LL so slow? # 22 31 list%tail%next%particle = particle list%tail => list%tail%next ⇒ Benchmark

  36. Normalized Runtimes on different Hosts 1e 7 JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI 2.0 Runtime per Particle / s 1.5 1.0 0.5 0.0 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles Andreas Herten | GPU-PiC on Minsky | 22 July 2017 Member of the Helmholtz Association Compiler/MPI Timings # 23 31 Normalized Runtimes on different Hosts 1e 7 JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI 2.0 Runtime per Particle / s 1.5 1.0 0.5 0.0 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles

  37. Member of the Helmholtz Association Compiler/MPI Timings Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 23 31 Normalized Runtimes on different Hosts 1e 7 JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI Normalized Runtimes on different Hosts 1e 7 JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI 2.0 Runtime per Particle / s 2.0 Runtime per Particle / s 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles Number of Particles

  38. Member of the Helmholtz Association Compiler/MPI Timings Andreas Herten | GPU-PiC on Minsky | 22 July 2017 x86 PGI # 23 31 Normalized Runtimes on different Hosts 1e 7 JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI Normalized Runtimes on different Hosts 1e 7 JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI 2.0 Runtime per Particle / s 2.0 Runtime per Particle / s 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles Number of Particles

  39. Member of the Helmholtz Association Compiler/MPI Timings Andreas Herten | GPU-PiC on Minsky | 22 July 2017 POWER PGI x86 PGI # 23 31 Normalized Runtimes on different Hosts 1e 7 JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI Normalized Runtimes on different Hosts 1e 7 JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI 2.0 Runtime per Particle / s 2.0 Runtime per Particle / s 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles Number of Particles

  40. Member of the Helmholtz Association Compiler/MPI Timings Andreas Herten | GPU-PiC on Minsky | 22 July 2017 POWER PGI/MPI POWER PGI x86 PGI # 23 31 Normalized Runtimes on different Hosts 1e 7 JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI Normalized Runtimes on different Hosts 1e 7 JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI JUHYDRA, GCC JURECA, GCC JURON, GCC JURON, PGIMPI JUHYDRA, PGI JURECA, PGI JURON, GCCMPI JURON, XLF 2.5 JUHYDRA, PGIMPI JURECA, PGIMPI JURON, PGI 2.0 Runtime per Particle / s 2.0 Runtime per Particle / s 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles Number of Particles

  41. Replacing pgfortran by mpifort leads to performance decrease Benchmark compilers – with PAPI [3] instrumentation Member of the Helmholtz Association 243 32 32 Instructions pP 121 121 243 462 48 121 210 210 See appendix for some more counters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 41 46 154 Compiler Investigation 37 36 Time pP/ns PGI PGIMPI GCC GCCMPI PGI PGIMPI PGIMPI* XLF Compiler JUHYDRA JURON System add_one_to_list benchmark does not use MPI at all! PGIMPI : MPI version shipped with PGI # 24 31 Is MPI Slow? And, by the way, which MPI!? Not actively used in GPU version of JuSPIC, but in future

  42. Benchmark compilers – with PAPI [3] instrumentation Member of the Helmholtz Association 243 32 32 Instructions pP 121 121 243 462 48 121 210 210 See appendix for some more counters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 41 46 154 Compiler Investigation PGIMPI : MPI version shipped with PGI add_one_to_list benchmark does not use MPI at all! System JURON JUHYDRA Compiler GCC GCCMPI PGI PGIMPI PGIMPI* XLF PGI PGIMPI Time pP/ns 36 37 # 24 31 Is MPI Slow? And, by the way, which MPI!? Not actively used in GPU version of JuSPIC, but in future Replacing pgfortran by mpifort leads to performance decrease

  43. Member of the Helmholtz Association 243 41 32 32 Instructions pP 121 121 462 154 243 121 210 210 See appendix for some more counters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 48 46 Compiler Investigation JUHYDRA PGIMPI : MPI version shipped with PGI add_one_to_list benchmark does not use MPI at all! System 37 JURON Compiler GCC GCCMPI PGI PGIMPI PGIMPI* XLF PGI PGIMPI Time pP/ns 36 # 24 31 Is MPI Slow? And, by the way, which MPI!? Not actively used in GPU version of JuSPIC, but in future Replacing pgfortran by mpifort leads to performance decrease → Benchmark compilers – with PAPI [3] instrumentation

  44. Member of the Helmholtz Association 243 41 32 32 Instructions pP 121 121 462 154 243 121 210 210 See appendix for some more counters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 48 46 Compiler Investigation JUHYDRA PGIMPI : MPI version shipped with PGI add_one_to_list benchmark does not use MPI at all! System 37 JURON Compiler GCC GCCMPI PGI PGIMPI PGIMPI* XLF PGI PGIMPI Time pP/ns 36 # 24 31 Is MPI Slow? And, by the way, which MPI!? Not actively used in GPU version of JuSPIC, but in future Replacing pgfortran by mpifort leads to performance decrease → Benchmark compilers – with PAPI [3] instrumentation

  45. — For now: consider as anomalous overhead Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Slow MPI-aware malloc() ? Mitigation — Bug reported Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31

  46. — For now: consider as anomalous overhead Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Slow MPI-aware malloc() ? Mitigation — Bug reported Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31

  47. — For now: consider as anomalous overhead Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Slow MPI-aware malloc() ? Mitigation — Bug reported Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31

  48. — For now: consider as anomalous overhead Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Slow MPI-aware malloc() ? Mitigation — Bug reported Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31

  49. — For now: consider as anomalous overhead Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Slow MPI-aware malloc() ? Mitigation — Bug reported Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31

  50. — For now: consider as anomalous overhead Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Mitigation — Bug reported Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31 ⇒ Slow MPI-aware malloc() ?

  51. Member of the Helmholtz Association Further Investigation/Mitigation MPI version shipped with PGI on POWER is slow, because it issues many instructions Further study: Identical assembly code generated as MPI-less version… … but includes call to malloc() ! Difgerent libraries linked for PGI and PGIMPI cases! LD_PRELOAD=/lib64/libc.so.6 solves problem! Mitigation — Bug reported — For now: consider as anomalous overhead Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 25 31 ⇒ Slow MPI-aware malloc() ?

  52. Member of the Helmholtz Association Performance Modelling Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 26 31

  53. N part Number of particles processed Member of the Helmholtz Association I N part Andreas Herten | GPU-PiC on Minsky | 22 July 2017 α, β Fit parameters; β: efgective bandwidth t Kernel runtime 40 B (write)) I Information exchanged (572 B (read) β , α Efgective Bandwidth t N part — Time for exchange — Amount of exchanged information for given number of particles of JuSPIC Defining the model # 27 31 Goal : Compare difgerent GPU architectures; understand behavior Model based on information exchanged of GPU kernel

  54. N part Number of particles processed Member of the Helmholtz Association Efgective Bandwidth Defining the model of JuSPIC — Amount of exchanged information for given number of particles — Time for exchange I Information exchanged (572 B (read) 40 B (write)) t Kernel runtime α, β Fit parameters; β: efgective bandwidth Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 27 31 Goal : Compare difgerent GPU architectures; understand behavior Model based on information exchanged of GPU kernel t ( N part ) = α + I ( N part ) / β ,

  55. Member of the Helmholtz Association Efgective Bandwidth Defining the model of JuSPIC — Amount of exchanged information for given number of particles — Time for exchange t Kernel runtime α, β Fit parameters; β: efgective bandwidth Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 27 31 Goal : Compare difgerent GPU architectures; understand behavior Model based on information exchanged of GPU kernel t ( N part ) = α + I ( N part ) / β , N part Number of particles processed I Information exchanged (572 B (read) + 40 B (write))

  56. Member of the Helmholtz Association Efgective Bandwidth Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 28 31 Measurements 700 Fit parameters K20: t = 19.05 + I /0.077 600 K40: t = 14.97 + I /0.095 ½ K80: t = 14.5 + I /0.1 P100: t = 21.26 + I /0.285 500 Minimum Kernel Duration t / µs 400 300 200 100 0 0 10 20 30 40 50 Information Exchange I / MB

  57. Member of the Helmholtz Association Efgective Bandwidth Andreas Herten | GPU-PiC on Minsky | 22 July 2017 P100: 1 ⁄ 2 K80: # 28 31 Measurements 700 Fit parameters K20: t = 19.05 + I /0.077 600 K40: t = 14.97 + I /0.095 ½ K80: t = 14.5 + I /0.1 P100: t = 21.26 + I /0.285 500 Minimum Kernel Duration t / µs K20 : 77 GB / s 400 K40 : 95 GB / s 100 GB / s 300 285 GB / s 200 100 0 0 10 20 30 40 50 Information Exchange I / MB

  58. Member of the Helmholtz Association Efgective Bandwidth Andreas Herten | GPU-PiC on Minsky | 22 July 2017 P100: 1 ⁄ 2 K80: P100: 1 ⁄ 2 K80: # 28 31 Measurements 700 Fit parameters K20: t = 19.05 + I /0.077 600 K40: t = 14.97 + I /0.095 ½ K80: t = 14.5 + I /0.1 P100: t = 21.26 + I /0.285 500 Minimum Kernel Duration t / µs K20 : K20 : 77 GB / s 77 GB / s 31 % 400 K40 : K40 : 95 GB / s 95 GB / s 33 % 100 GB / s 100 GB / s 42 % 300 285 GB / s 285 GB / s 40 % 200 100 0 0 10 20 30 40 50 Information Exchange I / MB

  59. GPU clock rate Member of the Helmholtz Association Clock Dependency Defining the relation models β γ δ β Efgective bandwidth (from before) γ, δ Fit parameters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 29 31 Another free parameter: GPU clock rates Varies significantly across GPU architecture generations and → Incorporate clock into performance model

  60. GPU clock rate Member of the Helmholtz Association Clock Dependency Defining the relation models β Efgective bandwidth (from before) γ, δ Fit parameters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 29 31 Another free parameter: GPU clock rates Varies significantly across GPU architecture generations and → Incorporate clock into performance model β ( C ) = γ + δ C

  61. Member of the Helmholtz Association Clock Dependency Defining the relation models β Efgective bandwidth (from before) γ, δ Fit parameters Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 29 31 Another free parameter: GPU clock rates Varies significantly across GPU architecture generations and → Incorporate clock into performance model β ( C ) = γ + δ C C GPU clock rate

  62. Member of the Helmholtz Association Clock Dependency Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 30 31 Measurements K40 300 ½ K80 P100 0.146 GB/s / MHz 250 Effective Bandwidth / GB/s 200 150 0.037 GB/s / MHz 100 0.138 GB/s / MHz 0.106 GB/s / MHz 600 800 1000 1200 1400 Graphics Clock Frequency / MHz

  63. Member of the Helmholtz Association Clock Dependency Andreas Herten | GPU-PiC on Minsky | 22 July 2017 P100: # 30 31 Measurements K40 300 ½ K80 P100 0.146 GB/s / MHz 250 Effective Bandwidth / GB/s K40 : 200 0 . 106 ( GB / s ) / ( MHz ) 1 ⁄ 2 K80 : 0 . 138 ( GB / s ) / ( MHz ) 0 . 037 ( GB / s ) / ( MHz ) 150 0 . 146 ( GB / s ) / ( MHz ) 0.037 GB/s / MHz 100 0.138 GB/s / MHz 0.106 GB/s / MHz 600 800 1000 1200 1400 Graphics Clock Frequency / MHz

  64. Member of the Helmholtz Association Summary, Conclusion Summary Particle data layout: SoA fastest Studied model with difgerent clock rates – P100 most efgicient scaling Future Enable MPI again Alternatives to linked list Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 31 31 Enabled JuSPIC for GPU with OpenACC & CUDA Fortran Slow memory allocation with PGI+MPI on POWER → bug filed Performance model: Information exchange (P100: 285 GB / s) Port also Reducer to GPU

  65. Member of the Helmholtz Association scaling Andreas Herten | GPU-PiC on Minsky | 22 July 2017 Summary, Conclusion Alternatives to linked list Enable MPI again Future # 31 31 Studied model with difgerent clock rates – P100 most efgicient Particle data layout: SoA fastest Summary Enabled JuSPIC for GPU with OpenACC & CUDA Fortran Slow memory allocation with PGI+MPI on POWER → bug filed Performance model: Information exchange (P100: 285 GB / s) Port also Reducer to GPU Thank you for your attention! e h . d c e l i j u f z - @ e n r t h e a .

  66. Member of the Helmholtz Association Appendix Acknowledgements Related Work OpenACC Performance Progression Linked List: Remove on JURON Selected Performance Counters on JURON References Glossary Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 1 14

  67. Member of the Helmholtz Association Acknowledgements The work was done in context of two groups: NVIDIA, and Forschungszentrum Jülich NVIDIA Application Lab A collaboration of NVIDIA and Forschungszentrum Jülich Many thanks to Jiri Kraus from NVIDIA, who helped tremendously along the way JURON, a prototype system for the Human Brain Project, received co-funding from the European Union (Grant Agreement No. 604102) Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 2 14 POWER Acceleration and Design Centre A collaboration of IBM ,

  68. Member of the Helmholtz Association Related Work Minsky porting experiences — “Addressing Materials Science Challenges Using GPU-accelerated POWER8 Nodes” [6] — “A Performance Model for GPU-Accelerated FDTD Applications” [7] … more in paper! Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 3 14 Selection of other GPU PiC codes PSC The code JuSPIC is based on has been reimplemented in C and ported to GPU [4] PIConGPU PiC code specifically developed for GPUs [5]

  69. Member of the Helmholtz Association OpenACC Performance Progression Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 4 14 Runtime / µs 3× 5× 10× Speedup of Kernel, relative to No-GPU 15× Speedup of full Pusher, relative to OpenACC (min. unrolled) 21× 21× 21× 20× 24× Figure: See GTC poster for details [8].

  70. Member of the Helmholtz Association Linked List: Time for Remove on JURON Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 5 14 For difgerent compilers Normalized Runtimes for PGI Compiler (w and w/o MPI) 1e 7 None,System,Compiler (Remove_, JUHYDRA, PGI) (Remove_, JUHYDRA, PGIMPI) 1.75 (Remove_, JURECA, PGI) (Remove_, JURECA, PGIMPI) (Remove_, JURON, PGI) (Remove_, JURON, PGIMPI) 1.50 1.25 Runtime per particle / s 1.00 0.75 0.50 0.25 0.00 100000 1000000 2500000 5000000 7500000 10000000 25000000 50000000 75000000 100000000 Number of Particles

  71. Member of the Helmholtz Association Selected Performance Counters on JURON Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 6 14 For difgerent compilers PAPI_TOT_INS PAPI_TOT_CYC 500 400 350 400 Counter Value per Particle Counter Value per Particle 300 300 250 200 200 150 100 100 50 0 0 PAPI_L1_DCM PAPI_STL_ICY 8 350 gfortran mpifort 7 300 pgfortran Counter Value per Particle Counter Value per Particle 6 250 5 200 4 150 3 100 2 50 1 0 0 10000 100000 250000 500000 750000 1000000 2500000 5000000 7500000 10000000 10000 100000 250000 500000 750000 1000000 2500000 5000000 7500000 10000000 Number of Particles Number of Particles

  72. Member of the Helmholtz Association References: Images, Graphics I [1] juelich.de/60jahre/DE/Geschichte/1956- 1960/Dekade/_node.html (page 3). [2] Forschungszentrum Jülich. Forschungszentrum Bird’s Eye . (Page 3). Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 7 14 Forschungszentrum Jülich. Hightech made in 1960: A view into the control room of DIDO . URL: http://historie.fz-

  73. Member of the Helmholtz Association References I Andreas Herten | GPU-PiC on Minsky | 22 July 2017 10.1145/2503210.2504564 (page 90). Storage and Analysis (SC) . Nov. 2013, pp. 1–12. DOI: Conference for High Performance Computing, Networking, M. Bussmann et al. “Radiative signature of the relativistic [5] # 8 14 modern particle-in-cell code with load-balancing and GPU K. Germaschewski et al. “The Plasma Simulation Code: A [4] (visited on 04/30/2017) (pages 63–66, 100). Programming Interface . URL: http://icl.utk.edu/papi/ [3] Phil Mucci and The ICL Team. PAPI, the Performance Application support”. In: ArXiv e-prints (Oct. 2013). arXiv: 1310.7866 [physics.plasm-ph] (page 90). Kelvin-Helmholtz Instability”. In: 2013 SC - International

  74. Member of the Helmholtz Association References II [6] Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings . Ed. by Pierre-François Dutot and Denis Trystram. Cham: Springer International Publishing, 2016, pp. 77–89. ISBN: http://dx.doi.org/10.1007/978-3-319-43659-3_6 (page 90). Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 9 14 Paul F. Baumeister et al. “Addressing Materials Science Challenges Using GPU-accelerated POWER8 Nodes”. In: 978-3-319-43659-3. DOI: 10.1007/978-3-319-43659-3_6 . URL:

  75. Member of the Helmholtz Association References III Andreas Herten | GPU-PiC on Minsky | 22 July 2017 (page 100). Defense HPCMP Users Group Conference . 1999, pp. 7–10 Philip J. Mucci et al. “PAPI: A Portable Interface to Hardware [9] Andreas Herten, Dirk Pleiter, and Dirk Brömmel. Accelerating [8] (page 90). International Conference on High Performance Computing P. F. Baumeister et al. “A Performance Model for [7] # 10 14 GPU-Accelerated FDTD Applications”. In: 2015 IEEE 22nd (HiPC) . Dec. 2015, pp. 185–193. DOI: 10.1109/HiPC.2015.24 Plasma Physics with GPUs (Poster) . Tech. rep. GPU Technology Conference, 2017 (page 91). Performance Counters”. In: In Proceedings of the Department of

  76. Member of the Helmholtz Association Glossary I 30, 31, 32, 33, 34, 86, 87 large and small supercomputers and connected 27, 28, 29, 30, 31, 63, 64, 65, 66, 75, 76, 77, 86, 87, 90 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 11 14 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++ . 2, 22, 23, 24, 26, 27, 28, 29, FZJ Forschungszentrum Jülich, a research center in the west of Germany. 3, 98 JSC Jülich Supercomputing Centre operates a number of infrastructure at FZJ. 3 JuSPIC Jülich Scalable Particle-in-Cell Code. 2, 9, 10, 11, 26,

  77. Member of the Helmholtz Association Glossary II message-passing application programmer interface. 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 86, 87 89, 98 37, 38, 39, 40, 41, 86, 87, 88, 91 Andreas Herten | GPU-PiC on Minsky | 22 July 2017 # 12 14 MPI The Message Passing Interface, a communication NVIDIA US technology company creating GPUs . 3, 4, 5, 6, 7, 8, NVLink NVIDIA’s communication protocol connecting CPU ↔ GPU and GPU ↔ GPU with 80 GB / s. PCI-Express: 16 GB / s. 4, 5, 6, 7, 8, 98 OpenACC Directive-based programming, primarily for many-core machines. 2, 14, 15, 16, 17, 18, 21, 22, 23, 24, 32, 33, 34,

  78. Member of the Helmholtz Association Glossary III Andreas Herten | GPU-PiC on Minsky | 22 July 2017 (plasma) physics simulations to solve partial 86, 87 NVIDIA. 2, 26, 27, 28, 29, 30, 31, 67, 68, 69, 70, 71, 72, 73, # 13 14 cross-platform [3, 9]. 63, 64, 65, 66 performance counters, also with aliased names memory. 2, 4, 5, 6, 7, 8, 79, 80, 85, 86, 87 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its interconnect and has fast HBM2 PAPI The Performance API, a interface for accessing Pascal The latest available GPU architecture from NVIDIA. 98 PGI Formerly The Portland Group, Inc. ; since 2013 part of PiC Particle in Cell; a method applied in a group of difgerential equations. 2, 10, 11, 90

Recommend


More recommend