How to Write a Parallel GPU Application Using CUDA and Charm++ - PowerPoint PPT Presentation

Apr 22, 2023 •437 likes •562 views

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski Outline GPGPUs and CUDA Requirements for a GPGPU API (from a Charm++ standpoint) CUDA stream approach Charm++ GPU Manager 2 General

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski
Outline • GPGPUs and CUDA • Requirements for a GPGPU API (from a Charm++ standpoint) • CUDA stream approach • Charm++ GPU Manager 2
General Purpose GPUs • Graphics chips adapted for general purpose programming • Impressive floating point performance – 4.6 Tflop/s single precision (AMD Radeon HD 5970) – Compared to about 100 Gflop/s for a 3 GHz quad- core quad-issue CPU • Throughput oriented • Good for large scale data parallelism 3
CUDA • A popular hardware/software architecture for GPGPUs • Supported on NVIDIA GPUs • Programmed using C with extensions for large- scale data parallelism • CPU is used to offload and manage units of GPU work 4
API Requirements • GPU operations should not block the CPU – blocking wastes CPU cycles and reduces response time for messages • Chares should be able to share the GPU without synchronizing with each other 5
Direct Approach • User makes CUDA calls directly in Charm++ • CUDA Streams – allow specifying an order of execution for a set of asynchronous GPU operations – Operations in different streams can overlap in execution • User assigns a unique CUDA stream for each chare and makes polling or synchronization calls to determine completion of operations 6
Problems with Direct Approach • Each chare must poll for completion of GPU operations – Tedious – Inefficient • Streams need to be carefully managed to allow overlap of GPU operations 7
Stream Management • Common stream usage CPU → GPU data transfer kernel_call GPU → CPU data transfer • Third operation blocks DMA engine until kernel is finished • Can be avoided by delaying GPU → CPU data transfer until kernel is finished – Requires an additional polling call 8
Overview of GPU Manager • User submits requests specifying work to be executed on the GPU, associated buffers, and callback • System transfers memory between CPU and GPU, executes request, and returns through a callback • GPU operations performed asynchronously • Pipelined execution 9
Execution of Work Requests 10
GPU Manager Advantages • No polling calls in user code – Simpler code – More efficient • System ensures overlap of GPU operations – Scheduling of pinned memory allocations • GPU profiling in Projections 11

Recommend

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model CUDA Memory Hierarchy and Memory Spaces CUDA Memory Hierarchy and Memory Spaces CUDA Synchronization 2110412 Parallel Comp Arch CUDA:

449 views • 12 slides

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department of Computer Engineering, Chulalongkorn University Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

705 views • 66 slides

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

GPU Teaching Kit Accelerated Computing Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn the main venues and developer resources for GPU computing Where CUDA C fits in the big picture 2 3

1.28k views • 12 slides

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard C A handful of

1.31k views • 62 slides

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU 16 Streaming multiprocessors 8 Streaming processors pr SM 8192 registers pr SM 768 threads pr SM

211 views • 6 slides

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need for CUDA Introduction to CUDA CUDA kernels, decompositions CUDA memory management C and Fortran OpenCL 2 NVIDIA CUDA

1.24k views • 37 slides

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge Gonzlez-Domnguez Parallel and Distributed Architectures Group Johannes Gutenberg

813 views • 46 slides

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Introduction Bindings in Ada CUDA/Ada Design CUDA Binding Conclusion CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied Sciences Rapperswil (HSR), Switzerland 1/16/2012 Master seminar: Program

469 views • 36 slides

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8, 2017 Why super GPU is needed Extending CUDA view into clusters Why super GPU is needed Extending CUDA view into clusters Example: Sparse Matrix

484 views • 13 slides

TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis George D Magoulas Two Major

TMBL Kernels for CUDA GPUs Compile Faster Using PTX Tony E Lewis George D Magoulas Two Major Approaches to GPU Acceleration of GP Data parallel Compile new GPU code for each new batch Population parallel Write one GPU interpreter to process

397 views • 27 slides

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A rchitecture released in 2007 GPU Computing Extension of C/C++ requires NVCC (CUDA Compiler) and NVIDIA Graphics Card Historical

507 views • 38 slides

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

GPU Teaching Kit Accelerated Computing Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become familiar with some valuable tools and resources from the CUDA Toolkit Compiler flags Debuggers

887 views • 34 slides

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics WS07/08 HW-Shading Overview So far: Introduction to Cuda GPGPU via Cuda (general purpose computing on the GPU) Block matrix-matrix

755 views • 49 slides

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation of LDPC codes

673 views • 27 slides

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department University of Malaga (Spain) Talk outline [40 slides] 1. Programming choices. [30] 1. CUDA libraries and tools. [10] 2. Targeting CUDA to other

1.17k views • 49 slides

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI The Case for Fortran Clear, straight-forward syntax Successful legacy in the scientific community Large existing code base

624 views • 20 slides

Extracting Semantic Transfer Rules from Parallel Corpora with SMT Phrase Aligners Petter

Semantic Transfer Two Methods of Rule Extraction Experiment and Results Discussion Conclusion References Extracting Semantic Transfer Rules from Parallel Corpora with SMT Phrase Aligners Petter Haugereid and Francis Bond Linguistics and

608 views • 46 slides

Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21,

Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Handling large datasets in heterogeneous parallel environments May 21, 2014 Michael Stephan Mitglied der Helmholtz-Gemeinschaft Portable Parallel I/O Part I: HDF5 May 21, 2014

892 views • 66 slides

Algorithms for NLP CS 11711, Fall 2019 Lecture 21: Machine Translation I Yulia Tsvetkov 1

Algorithms for NLP CS 11711, Fall 2019 Lecture 21: Machine Translation I Yulia Tsvetkov 1 Machine Translation from Dream of the Red Chamber Cao Xue Qin (1792) English: leg, foot, paw French: jambe, pied, patte, etape Challenges Ambiguities

1.6k views • 140 slides

ARCHER/RDF Overview How do they fit together? Andy Turner, EPCC a.turner@epcc.ed.ac.uk

ARCHER/RDF Overview How do they fit together? Andy Turner, EPCC a.turner@epcc.ed.ac.uk www.epcc.ed.ac.uk www.archer.ac.uk Outline ARCHER/RDF Layout Available file systems Compute resources ARCHER Compute Nodes ARCHER

177 views • 17 slides

1 MPI-based SILC system Data transfer: the sequential case Currently based on a client-server

Outline Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA '06) Ume, Sweden, June 18-21, 2006 Background Ways of using matrix computation libraries Distributed SILC: An easy-to-use Distributed SILC interface

232 views • 4 slides

29. Parallel Programming III public: ... void withdraw(int amount) { guard g(m); ... } void

Deadlock Motivation class BankAccount { int balance = 0; std::recursive_mutex m; using guard = std::lock_guard<std::recursive_mutex>; 29. Parallel Programming III public: ... void withdraw(int amount) { guard g(m); ... } void

192 views • 7 slides

Managing Complexity in the Parallel Sparse Grid Combination Technique J. W. Larson 1 P. E.

Managing Complexity in the Parallel Sparse Grid Combination Technique J. W. Larson 1 P. E. Strazdins 2 M. Hegland 1 B. Harding 1 S. Roberts 1 L. Stals 1 A. P. Rendell 2 M. Ali 2 J. Southern 3 1 Mathematical Sciences Institute, The Australian

570 views • 25 slides

CMSC427 Notes on piecewise parametric curves: Hermite, Catmull-Rom, and Bezier I. Parametric

CMSC427 Notes on piecewise parametric curves: Hermite, Catmull-Rom, and Bezier I. Parametric curves and surfaces Model shapes and behavior with parametric curves Have done lines, circles, cylinders, superellipses, and others But limitations

390 views • 7 slides