Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013
Inroduction Data Analysis Visualisation Data Processing • Products Free Python distribution Enterprise version available • Training which includes GPU support Scalable Scientific • Support Computing Computing • Consulting Enterprise Wakari Python Scientific Python in your browser Available to install in your data-center
Big Picture Empower domain experts, subject matter experts, and other occasional programmers with high-level tools that ? exploit modern hardware Array Oriented Computing
Why Array-oriented computing Object Attr1 Object Object Attr2 Attr1 Attr1 • Express domain knowledge directly in Attr3 Attr2 Attr2 Attr3 Attr3 Object arrays (tensors, matrices, vectors) --- Attr1 Object Object Attr2 Attr1 Attr1 Attr3 Attr2 easier to teach programming in domain Attr2 Attr3 Attr3 • Can take advantage of parallelism and Attr1 Attr2 Attr3 accelerators like the GPU Object1 Object2 Object3 Object4 Object5 Object6
Why Python IPython License • Interactive prompt on steroids (Notebook) Free and Open Source, Permissive License • Allows less working memory • Allows failing quickly for exploration Community Modern Constructs • Broad and friendly community • List comprehensions • Over 36,000 packages on PyPI • Iterator protocol and generators • Commercial Support • Meta-programming • Many conferences (PyData, SciPy, PyCon...) • Introspection • JIT Compiler and Concurrency (Numba) Readable Syntax Batteries Included • Executable pseudo-code • Can understand and edit code a year later • Internet (FTP , HTTP , SMTP , XMLRPC) • Fun to develop • Compression and Databases • Use of Indentation • Great Visualization tools (Bokeh, Matplotlib, etc.) • Powerful libraries for STEM • Integration with C/C++/Fortran
Breaking the Speed Barrier (Numba!) rapid iteration and development = + ideal combination! fast code execution Python syntax but no GIL Numba aims to be the world’s best Native code speed for Numerical array-oriented compiler. computing (NumPy code)
NumPy + Mamba = Numba Machine Code Python Function LLVM-PY LLVM Library ISPC OpenCL OpenMP CUDA CLANG ARM Intel AMD Nvidia Apple
Example @jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) Numba
Compiler Overview C++ x86 C LLVM IR ARM Fortran PTX Numba Numba turns Python into a “compiled language”
~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)
Anaconda Accelerate Fast development and execution • Compile NumPy array expressions for the CPU and GPU • Create parallel-for loops • Parallel execution of ufuncs • Run ufuncs on the GPU Python and NumPy stack compiled • Write CUDA directly in Python! to Parallel Architectures • Requires CUDA 5.5 (GPUs and multi-core machines) $ conda install accelerate
NumbaPro Features • CUDA Python • Vectorize --- NumPy functions on the GPU • Array expressions • Parallel for loops • Access to fast libraries (cuRAND, cuFFT, cuBLAS)
Compile NumPy array expressions import numbapro from numba import autojit @autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2
Create parallel-for loops “prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma) import numbapro # import first to make prange available from numba import autojit, prange @autojit def parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]
Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python! from numbapro import vectorize from math import sin, pi @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,100) y = sinc(x)
Ufuncs in parallel (multi-thread or GPU) x = numpy.linspace(-5,5,1000) from numbapro import vectorize y = sinc(x) # On GPU from math import sin z = sinc2(x) # Multiple CPUs @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
Example Benchmark Overhead of memory transfer is over-come after about 1 million floats for simple computation About 1ms of overhead for memory transfer and set-up
Using Vectorize from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j Kind Time Speed-up for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: Python 263.6 1.0x return i return 255 CPU 2.639 100x Tesla S2050 GPU 0.1676 1573x
Grouping Calculations Create “generalized ufuncs” whose elements are “arrays” prototype = "void(float32[:,:], float32[:,:], float32[:,:])" # creates an array of 1000 x 2 x 4 @guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu') A = np.arange(matrix_ct * 2 * 4, def matmul(A, B, C): dtype=np.float32 m, n = A.shape ).reshape(1000, 2, 4) n, p = B.shape # creates an array of 1000 x 4 x 5 for i in range(m): B = np.arange(matrix_ct * 4 * 5, for j in range(p): dtype=np.float32 C[i, j] = 0 ).reshape(1000, 4, 5) for k in range(n): # outputs an array of 1000 x 2 x 5 C[i, j] += A[i, k] * B[k, j] C = matmul(A, B)
Using cuBLAS import numpy as np from numbapro.cudalib import cublas A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N)) B = np.array(np.arange(N) + 10, dtype=A.dtype) D = np.zeros_like(A, order='F') # NumPy E = np.dot(A, np.diag(B)) # cuBLAS blas = cublas.Blas() blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)
FFT Convolution with cuFFT works with 1d,2d,3d @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b from numbapro.cudalib import cufft # host -> device d_img = cuda.to_device(img) # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host()
Monte-Carlo Pricing and cuRAND from numbapro import vectorize, cuda from numbapro.cudalib import curand @vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu') def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) Tuned kernel def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] from numbapro import jit, cuda blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK @jit('void(double[:], double[:], double, double, gridsz = int(math.ceil(float(n) / blksz)) double, double[:])', # Instantiate cuRAND PRNG target='gpu') prng = curand.PRNG(curand.PRNG.MRG32K3A) def step(last, paths, dt, c0, c1, normdist): # Allocate device side array # expands to i = cuda.threadIdx.x + cuda.blockIdx.x d_normdist = cuda.device_array(n, dtype=np.double) * cuda.blockDim.x i = cuda.grid(1) c0 = interest - 0.5 * volatility ** 2 if i >= paths.shape[0]: c1 = volatility * math.sqrt(dt) return # Simulation loop noise = normdist[i] d_last = cuda.to_device(paths[:, 0]) paths[i] = last[i] * math.exp(c0 * dt + c1 * noise) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths
Benchmark http://continuum.io/blog/monte-carlo-pricer
CUDA Python from numbapro import cuda from numba import autojit @autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x CUDA Development i = tid + blkid * blkdim directly in Python if i >= n: return dst[i] = src[i] * scale src = np.arange(N, dtype=np.float) dst = np.empty_like(src) array_scale[grid, block](src, dst, 5.0)
Recommend
More recommend