Applications of Programming the GPU Directly from Python Using - PowerPoint PPT Presentation

Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013

Inroduction Data Analysis Visualisation Data Processing • Products Free Python distribution Enterprise version available • Training which includes GPU support Scalable Scientific • Support Computing Computing • Consulting Enterprise Wakari Python Scientific Python in your browser Available to install in your data-center

Big Picture Empower domain experts, subject matter experts, and other occasional programmers with high-level tools that ? exploit modern hardware Array Oriented Computing

Why Array-oriented computing Object Attr1 Object Object Attr2 Attr1 Attr1 • Express domain knowledge directly in Attr3 Attr2 Attr2 Attr3 Attr3 Object arrays (tensors, matrices, vectors) --- Attr1 Object Object Attr2 Attr1 Attr1 Attr3 Attr2 easier to teach programming in domain Attr2 Attr3 Attr3 • Can take advantage of parallelism and Attr1 Attr2 Attr3 accelerators like the GPU Object1 Object2 Object3 Object4 Object5 Object6

Why Python IPython License • Interactive prompt on steroids (Notebook) Free and Open Source, Permissive License • Allows less working memory • Allows failing quickly for exploration Community Modern Constructs • Broad and friendly community • List comprehensions • Over 36,000 packages on PyPI • Iterator protocol and generators • Commercial Support • Meta-programming • Many conferences (PyData, SciPy, PyCon...) • Introspection • JIT Compiler and Concurrency (Numba) Readable Syntax Batteries Included • Executable pseudo-code • Can understand and edit code a year later • Internet (FTP , HTTP , SMTP , XMLRPC) • Fun to develop • Compression and Databases • Use of Indentation • Great Visualization tools (Bokeh, Matplotlib, etc.) • Powerful libraries for STEM • Integration with C/C++/Fortran

Breaking the Speed Barrier (Numba!) rapid iteration and development = + ideal combination! fast code execution Python syntax but no GIL Numba aims to be the world’s best Native code speed for Numerical array-oriented compiler. computing (NumPy code)

NumPy + Mamba = Numba Machine Code Python Function LLVM-PY LLVM Library ISPC OpenCL OpenMP CUDA CLANG ARM Intel AMD Nvidia Apple

Example @jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) Numba

Compiler Overview C++ x86 C LLVM IR ARM Fortran PTX Numba Numba turns Python into a “compiled language”

~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

Anaconda Accelerate Fast development and execution • Compile NumPy array expressions for the CPU and GPU • Create parallel-for loops • Parallel execution of ufuncs • Run ufuncs on the GPU Python and NumPy stack compiled • Write CUDA directly in Python! to Parallel Architectures • Requires CUDA 5.5 (GPUs and multi-core machines) $ conda install accelerate

NumbaPro Features • CUDA Python • Vectorize --- NumPy functions on the GPU • Array expressions • Parallel for loops • Access to fast libraries (cuRAND, cuFFT, cuBLAS)

Compile NumPy array expressions import numbapro from numba import autojit @autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

Create parallel-for loops “prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma) import numbapro # import first to make prange available from numba import autojit, prange @autojit def parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python! from numbapro import vectorize from math import sin, pi @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,100) y = sinc(x)

Ufuncs in parallel (multi-thread or GPU) x = numpy.linspace(-5,5,1000) from numbapro import vectorize y = sinc(x) # On GPU from math import sin z = sinc2(x) # Multiple CPUs @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Example Benchmark Overhead of memory transfer is over-come after about 1 million floats for simple computation About 1ms of overhead for memory transfer and set-up

Using Vectorize from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j Kind Time Speed-up for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: Python 263.6 1.0x return i return 255 CPU 2.639 100x Tesla S2050 GPU 0.1676 1573x

Grouping Calculations Create “generalized ufuncs” whose elements are “arrays” prototype = "void(float32[:,:], float32[:,:], float32[:,:])" # creates an array of 1000 x 2 x 4 @guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu') A = np.arange(matrix_ct * 2 * 4, def matmul(A, B, C): dtype=np.float32 m, n = A.shape ).reshape(1000, 2, 4) n, p = B.shape # creates an array of 1000 x 4 x 5 for i in range(m): B = np.arange(matrix_ct * 4 * 5, for j in range(p): dtype=np.float32 C[i, j] = 0 ).reshape(1000, 4, 5) for k in range(n): # outputs an array of 1000 x 2 x 5 C[i, j] += A[i, k] * B[k, j] C = matmul(A, B)

Using cuBLAS import numpy as np from numbapro.cudalib import cublas A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N)) B = np.array(np.arange(N) + 10, dtype=A.dtype) D = np.zeros_like(A, order='F') # NumPy E = np.dot(A, np.diag(B)) # cuBLAS blas = cublas.Blas() blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)

FFT Convolution with cuFFT works with 1d,2d,3d @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b from numbapro.cudalib import cufft # host -> device d_img = cuda.to_device(img) # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host()

Monte-Carlo Pricing and cuRAND from numbapro import vectorize, cuda from numbapro.cudalib import curand @vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu') def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) Tuned kernel def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] from numbapro import jit, cuda blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK @jit('void(double[:], double[:], double, double, gridsz = int(math.ceil(float(n) / blksz)) double, double[:])', # Instantiate cuRAND PRNG target='gpu') prng = curand.PRNG(curand.PRNG.MRG32K3A) def step(last, paths, dt, c0, c1, normdist): # Allocate device side array # expands to i = cuda.threadIdx.x + cuda.blockIdx.x d_normdist = cuda.device_array(n, dtype=np.double) * cuda.blockDim.x i = cuda.grid(1) c0 = interest - 0.5 * volatility ** 2 if i >= paths.shape[0]: c1 = volatility * math.sqrt(dt) return # Simulation loop noise = normdist[i] d_last = cuda.to_device(paths[:, 0]) paths[i] = last[i] * math.exp(c0 * dt + c1 * noise) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths

Benchmark http://continuum.io/blog/monte-carlo-pricer

CUDA Python from numbapro import cuda from numba import autojit @autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x CUDA Development i = tid + blkid * blkdim directly in Python if i >= n: return dst[i] = src[i] * scale src = np.arange(N, dtype=np.float) dst = np.empty_like(src) array_scale[grid, block](src, dst, 5.0)

Applications of Programming the GPU Directly from Python Using - PowerPoint PPT Presentation

Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013 Inroduction Data Analysis Visualisation Data Processing Products Free Python distribution

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Python 1 Python Python is high-level programming language for general-purpose programming.

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Accelerated Solutions with Python and CUDA Luciano Martins Principal Software Engineer Oracle

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018 OUTLINE

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

Improving Trust in Software through Diverse Double Compilation and Reproducible Builds Yrjan

Published on February 25, 2008 by CIO.com S OURCE C ODE E SCROW : A RE Y OU J UST F OLLOWING T HE H

Weaving Ada 95 into the .Net Environment Martin C. Carlisle, Ricky E. Sward, and Jeffrey W.

W8 November 17, 2004 1:45 PM T EST A UTOMATION USING S CRIPTING L ANGUAGES Dion Johnson DiJohn

http://golang.org Thursday, July 22, 2010 Public Static Void Rob Pike OSCON July 22, 2010

Modeling: A Short Course in OpenModelica and Dakota Danielle Mass Wright State University

L A T EX Presentation Timoth Van Meter Wise Lab, University of Illinois at Chicago (UIC)

VSI's Open Source Strategy Plans and schemes for Open Source so9ware on OpenVMS Bre% Cameron /

Sambuz

Useful Links

Newsletter

Mail Us

Applications of Programming the GPU Directly from Python Using - PowerPoint PPT Presentation

Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013 Inroduction Data Analysis Visualisation Data Processing Products Free Python distribution

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Python 1 Python Python is high-level programming language for general-purpose programming.

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Accelerated Solutions with Python and CUDA Luciano Martins Principal Software Engineer Oracle

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018 OUTLINE

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

Improving Trust in Software through Diverse Double Compilation and Reproducible Builds Yrjan

Published on February 25, 2008 by CIO.com S OURCE C ODE E SCROW : A RE Y OU J UST F OLLOWING T HE H

Weaving Ada 95 into the .Net Environment Martin C. Carlisle, Ricky E. Sward, and Jeffrey W.

W8 November 17, 2004 1:45 PM T EST A UTOMATION USING S CRIPTING L ANGUAGES Dion Johnson DiJohn

http://golang.org Thursday, July 22, 2010 Public Static Void Rob Pike OSCON July 22, 2010

Modeling: A Short Course in OpenModelica and Dakota Danielle Mass Wright State University

L A T EX Presentation Timoth Van Meter Wise Lab, University of Illinois at Chicago (UIC)

VSI's Open Source Strategy Plans and schemes for Open Source so9ware on OpenVMS Bre% Cameron /

Sambuz

Useful Links

Newsletter

Mail Us

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team