/home/ytang/slides /home/ytang/exercise make your own copy! - PowerPoint PPT Presentation

• /home/ytang/slides • • /home/ytang/exercise – make your own copy! • /home/ytang/solution • http://docs.nvidia.com/cuda/index.html

• • •

• • • • • •

• 𝑄 ∝ 𝑊 2 𝑔 • •

• a = b + c; d = c + a; f = c + e; • • • •

• • • 64-bit DP FMA 256-bit On-chip SRAM 256-bit Off-chip DRAM Energy 20 pJ 50 pJ 16 nJ

• • Rank Name GFLOPS/W Configuration ASUS ESC4000 FDR/G2S, Intel Xeon E5-2690v2 10C 3GHz, Infiniband FDR, • 1 L-CSC 5.3 A MD FirePro S9150 ExaScaler 32U256SC Cluster, Intel Xeon E5-2660v2 10C 2.2GHz, 2 Suiren 4.9 Infiniband FDR, PEZY-SC • 3 Tsubame-KFC 4.5 Intel Xeon E5-2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20x Cray CS-Storm, Intel Xeon E5-2660v2 10C 2.2GHz, Infiniband FDR, Nvidia • 4 Storm1 4.0 K40m 5 Wilkes Intel Xeon E5-2630v2 6C 2.600GHz, Infiniband FDR, NVIDIA K20 3.6 • iDataPlex 6 3;5 Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x DX360M4 • 7 HA-PACS TCA 3.5 Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20x Cartesius Bullx B515 cluster, Intel Xeon E5-2450v2 8C 2.5GHz, InfiniBand 4× FDR, • 8 Accelerator 3.5 N vidia K40m Island 9 Piz Daint 3.2 Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x •

• FLOPS GB/s 10000 600 500 8000 400 6000 300 4000 200 2000 100 0 0 Single precision Double precision Off-chip memory bandwidth Kepler K80 Xeon E5-2699 v3 GeForce Tesla CPU

• • SM SM SM SM SM SM SM SM

知己知彼，百战不殆 ↗ Data Parallel ↘ Task Parallel ↗ Intensive FP Arithemtic ↘ Thread Dependencies ↗ Fine-grained parallelism ↘ Serial work ↘ Coarse-grained parallelism

Language C C++ Fortran … Extensions Directives OpenACC OpenMP4 … Libraries cuBLAS cuSPARSE cuFFT cuRAND … Scripting PyCUDA MATLAB …

• m = magic(64); % m is on CPU M = gpuArray ( m ); % M is on GPU now • n = fft2( m ); % FFT on CPU N = fft2( M ); % FFT on GPU • L = gather( N ); % transfer N back to CPU find( abs( L – n ) > 1e-9 ); •

• Feature Availability Remark Control flow Y Built-in data types: char, int, float, etc. Y vector types: int2, float4… Built-in operators Y including new/delete Overloading Y Inheritance Object-oriented programming Y virtual methods Templates Y C standard library Partial printf, malloc, free supported C++ standard library N C++11 extensions Y variadic template, lambda

• #include <cstdio> #include <cstdio> #include <cuda.h> #include <cuda_runtime.h> nvcc – arch=sm_35 hello.cu – o hello.x __global__ void hello_gpu() { printf( "\"Hello, world!\", says the GPU.\n" ); } • void hello_cpu() { void hello_cpu() { printf( "\"Hello, world!\", says the CPU.\n" ); printf( "\"Hello, world!\", says the CPU.\n" ); } } 1.0  1.1  1.2  1.3  2.0  2.1 int main( int argc, char **argv ) int main( int argc, char **argv ) { {  3.0  3.5*  5.0  ... hello_cpu(); hello_cpu(); hello_gpu<<< 1, 1>>>(); cudaDeviceSynchronize(); return 0; return 0; } }

• #include <cstdio> #include <cuda.h> • #include <cuda_runtime.h> • __global__ void hello_gpu() { printf( "\"Hello, world!\", says the GPU.\n" ); • } • void hello_cpu() { printf( "\"Hello, world!\", says the CPU.\n" ); } • // host code entrance • int main( int argc, char **argv ) { hello_cpu(); • hello_gpu<<< 1, 1>>>(); cudaDeviceSynchronize(); } • •

Hardware Software • • init GRAM parallel work 1 CPU GPU serial work 1 serial work 2 parallel work 2 RAM finalize GPU CPU

Grid • divide et impera Block(0,0) Block(1,0) Block(2,0) • • • Block(0,1) Block(1,1) Block(2,1) • • • Block • Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) • Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) • Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2) • Thread(0,3) Thread(1,3) Thread(2,3) Thread(3,3)

• • __global__ // each thread will print once __global__ void hello() { printf( "\"Hello, world!\", says the GPU.\n" ); } • kernel<<<numBlocks,threadsPerBlock>>>(args); threadIdx • •

• __global__ • __device__ • • __host__ • • __device__ __host__ • __global__ __inline__ __host__ __device__ double force( double x ) { return -0.5 * K * ( x - x0 ); }

• struct dim3 { uint x,y,z; }; • • thread index within the current block threadIdx block index within the current grid blockIdx block size blockDim grid size, i.e. number of blocks in each dimension gridDim • • •

• • • • cudaError_t cudaMalloc ( void** devPtr, size_t size ); • cudaError_t cudaFree ( void* devPtr ) ; • device-side malloc/new/free/delete • ptr[ index ] = value; • • cudaError_t cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind ); • cudaError_t cudaMemset ( void* devPtr, int value, size_t count );

• 𝑔 𝑦 = sin 𝑦 ⋅ cos 7𝑦 ⋅ 𝑓 𝑦 , 𝑦 ∈ 0,1 #include <cstdio> // copy result back to CPU #include <iostream> cudaMemcpy( hst_y, dev_y, N * sizeof( double ), #include <vector> cudaMemcpyDefault ); #include <limits> t_GPU_2 = get_time(); #include <cuda.h> #include <cuda_runtime.h> t_CPU_0 = get_time(); #include <omp.h> // calculate reference value #include "../util/util.h" #pragma omp parallel for for( int i = 0; i < N; i++ ) ref_y[i] = f( (double)i / __inline__ __host__ __device__ double f( double x ) { (double)N ); return sin( 2.0*x ) * cos( 7.0*x ) * exp( x ); } t_CPU_1 = get_time(); __global__ void evaluate( double *y, const int n ) // compare { bool match = true; int i = global_thread_id(); for( int i = 0; i < N; i++ ) { y[i] = f( (double)i / (double)n ); match = match && } ( fabs( ref_y[i] - hst_y[i] ) < 8 * std::numeric_limits<double>::epsilon() ); // host code entrance } int main( int argc, char **argv ) { // output int N = 128 * 1024 * 1024; std::cout << "Computation on CPU took " << t_CPU_1 - t_CPU_0 << " secs." << std::endl; // timing register std::cout << "Computation on GPU took " << t_GPU_1 - double t_CPU_0, t_CPU_1, t_GPU_0, t_GPU_1, t_GPU_2; t_GPU_0 << " secs." << std::endl; // allocate host memory std::cout << "Data transfer from GPU took " << t_GPU_2 double *hst_y, *ref_y; - t_GPU_1 << " secs." << std::endl; hst_y = new double[N]; std::cout << "CPU/GPU result match: " << ( match ? ref_y = new double[N]; "YES" : "NO" ) << std::endl; // allocate device memory double *dev_y; // free up resources cudaMalloc( &dev_y, N * sizeof( double ) ); delete [] hst_y; delete [] ref_y; t_GPU_0 = get_time(); cudaDeviceReset(); } // do computation on GPU evaluate <<< N / 1024, 1024 >>> ( dev_y, N ); cudaDeviceSynchronize(); t_GPU_1 = get_time();

𝑏 𝑦 + 𝑧 • 𝑏 • 𝑦, 𝑧 • ?

#include <cstdio> #include <cuda.h> #include <cuda_runtime.h> __global__ void hello_gpu() { printf( "\"Hello, world!\", says GPU block (%d,%d) thread (%d,%d).\n", blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y ); } void hello_cpu() { printf( "\"Hello, world!\", says the CPU.\n" ); } // host code entrance int main( int argc, char **argv ) { hello_cpu(); printf( "launching 2x2 blocks each containing 4 threads\n" ); hello_gpu <<< dim3( 2, 2, 1 ), dim3( 4, 1, 1 ) >>>(); cudaDeviceSynchronize(); printf( "launching 2x2 blocks each containing 2x2 threads\n" ); hello_gpu <<< dim3( 2, 2, 1 ), dim3( 2, 2, 1 ) >>>(); cudaDeviceSynchronize(); cudaDeviceSynchronize(); }

• 𝑔 𝑦, 𝑧 = sin 5𝑦 ⋅ cos 16𝑧 ⋅ 𝑓 𝑦 , 𝑦 ∈ 0,1 , 𝑧 ∈ 0,1 ?

__shared__ int sum; int b 0 = ...; • register r 0 = sum; r 0 += b 0 ; __shared__ int sum; int b 1 = ...; int b = ...; register r 1 = sum; register r = sum; sum = r 0 ; __shared__ int sum; r += b; r 1 += b 1 ; int b = ...; sum = r; sum = r 1 ; sum += b; • • • modify = add, sub, exchange, etc... • float

• • 𝑜−1 𝑏 𝑗 𝑇 𝑜 = σ 𝑗=0 • • for(int i = 0 ; i < n ; i++) sum += a[i]; •

/home/ytang/slides /home/ytang/exercise make your own copy! - PowerPoint PPT Presentation

/home/ytang/slides /home/ytang/exercise make your own copy! /home/ytang/solution http://docs.nvidia.com/cuda/index.html 2 a = b + c;

/home/ytang/slides http://docs.nvidia.com/cuda/index.html

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

Video Broadband Voice In Home In Home In Home Out of Home Out of Home Out of Home Safe

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

NATIONAL HOME WATCH ASSOCIATION NATIONAL HOME WATCH ASSOCIATION NATIONAL HOME WATCH ASSOCIATION

Google Slides Home Screen slides.google.com The home screens serve as a central place to collect

No place like No place like HOME No place like No place like HOME HOME HOME (Harmonising

Remodeling Your Bathroom the JMC Way! Home Improvement Specialists Home Improvement Specialists

The Desirable Home The three factors in selling a home H Home buying is an emotional

Enhanced Home Health Care Enhanced Home Health Care Available for Rex and UNC UNC Home Health and

$ Lesson Five Buying a Home 04/09 buying a home Buying a home is easier than you might think

$ Lesson Three Buying a Home 04/09 buying a home Buying a home is easier than you might think

What is a Loxone Smart Home? A Home That Takes Care Of Itself Life at home is like flying on

HOME CENTER 2 FIBARO SYSTEM Get to know FIBARO home FIBARO smart home is a complex ecosystem,

Algebraic Perspectives in Interacting Classical Field Theories Romeo Brunetti Universit` a di

Coupling Solar Simulations with Space Weather: Code Comparison in SWIFF Consortium Vyacheslav

Input & Output Input & Output York University CSE 3401 Vida Movahedi 1 York University

An approximation theorem of Runge type for kernels of certain non-elliptic partial differential

Marc Paterno and V. Daniel Elvira Fermilab 2 nd Annual

Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) Sparse Linear Algebra

Applications in finite state automata Presentation Topics and First Steps Kurt Eberle

On the Duality of Operating System Structures Hugh Lauer Xerox Corporation Roger Needham Cambridge

/home/ytang/slides /home/ytang/exercise make your own copy! - PowerPoint PPT Presentation

/home/ytang/slides /home/ytang/exercise make your own copy! /home/ytang/solution http://docs.nvidia.com/cuda/index.html 2 a = b + c;

/home/ytang/slides http://docs.nvidia.com/cuda/index.html

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

Video Broadband Voice In Home In Home In Home Out of Home Out of Home Out of Home Safe

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

NATIONAL HOME WATCH ASSOCIATION NATIONAL HOME WATCH ASSOCIATION NATIONAL HOME WATCH ASSOCIATION

Google Slides Home Screen slides.google.com The home screens serve as a central place to collect

No place like No place like HOME No place like No place like HOME HOME HOME (Harmonising

Remodeling Your Bathroom the JMC Way! Home Improvement Specialists Home Improvement Specialists

The Desirable Home The three factors in selling a home H Home buying is an emotional

Enhanced Home Health Care Enhanced Home Health Care Available for Rex and UNC UNC Home Health and

$ Lesson Five Buying a Home 04/09 buying a home Buying a home is easier than you might think

$ Lesson Three Buying a Home 04/09 buying a home Buying a home is easier than you might think

What is a Loxone Smart Home? A Home That Takes Care Of Itself Life at home is like flying on

HOME CENTER 2 FIBARO SYSTEM Get to know FIBARO home FIBARO smart home is a complex ecosystem,

Algebraic Perspectives in Interacting Classical Field Theories Romeo Brunetti Universit` a di

Coupling Solar Simulations with Space Weather: Code Comparison in SWIFF Consortium Vyacheslav

Input &amp; Output Input &amp; Output York University CSE 3401 Vida Movahedi 1 York University

An approximation theorem of Runge type for kernels of certain non-elliptic partial differential

Marc Paterno and V. Daniel Elvira Fermilab 2 nd Annual

Outline Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) Sparse Linear Algebra

Applications in finite state automata Presentation Topics and First Steps Kurt Eberle

On the Duality of Operating System Structures Hugh Lauer Xerox Corporation Roger Needham Cambridge

Input & Output Input & Output York University CSE 3401 Vida Movahedi 1 York University