applications of programming the gpu directly from python
play

Applications of Programming the GPU Directly from Python Using - PowerPoint PPT Presentation

Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013 Inroduction Data Analysis Visualisation Data Processing Products Free Python distribution


  1. Applications of Programming the GPU Directly from Python Using NumbaPro Travis E. Oliphant, Ph.D. Supercomputing 2013 November 20, 2013

  2. Inroduction Data Analysis Visualisation Data Processing • Products Free Python distribution Enterprise version available • Training which includes GPU support Scalable Scientific • Support Computing Computing • Consulting Enterprise Wakari Python Scientific Python in your browser Available to install in your data-center

  3. Big Picture Empower domain experts, subject matter experts, and other occasional programmers with high-level tools that ? exploit modern hardware Array Oriented Computing

  4. Why Array-oriented computing Object Attr1 Object Object Attr2 Attr1 Attr1 • Express domain knowledge directly in Attr3 Attr2 Attr2 Attr3 Attr3 Object arrays (tensors, matrices, vectors) --- Attr1 Object Object Attr2 Attr1 Attr1 Attr3 Attr2 easier to teach programming in domain Attr2 Attr3 Attr3 • Can take advantage of parallelism and Attr1 Attr2 Attr3 accelerators like the GPU Object1 Object2 Object3 Object4 Object5 Object6

  5. Why Python IPython License • Interactive prompt on steroids (Notebook) Free and Open Source, Permissive License • Allows less working memory • Allows failing quickly for exploration Community Modern Constructs • Broad and friendly community • List comprehensions • Over 36,000 packages on PyPI • Iterator protocol and generators • Commercial Support • Meta-programming • Many conferences (PyData, SciPy, PyCon...) • Introspection • JIT Compiler and Concurrency (Numba) Readable Syntax Batteries Included • Executable pseudo-code • Can understand and edit code a year later • Internet (FTP , HTTP , SMTP , XMLRPC) • Fun to develop • Compression and Databases • Use of Indentation • Great Visualization tools (Bokeh, Matplotlib, etc.) • Powerful libraries for STEM • Integration with C/C++/Fortran

  6. Breaking the Speed Barrier (Numba!) rapid iteration and development = + ideal combination! fast code execution Python syntax but no GIL Numba aims to be the world’s best Native code speed for Numerical array-oriented compiler. computing (NumPy code)

  7. NumPy + Mamba = Numba Machine Code Python Function LLVM-PY LLVM Library ISPC OpenCL OpenMP CUDA CLANG ARM Intel AMD Nvidia Apple

  8. Example @jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) Numba

  9. Compiler Overview C++ x86 C LLVM IR ARM Fortran PTX Numba Numba turns Python into a “compiled language”

  10. ~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

  11. Anaconda Accelerate Fast development and execution • Compile NumPy array expressions for the CPU and GPU • Create parallel-for loops • Parallel execution of ufuncs • Run ufuncs on the GPU Python and NumPy stack compiled • Write CUDA directly in Python! to Parallel Architectures • Requires CUDA 5.5 (GPUs and multi-core machines) $ conda install accelerate

  12. NumbaPro Features • CUDA Python • Vectorize --- NumPy functions on the GPU • Array expressions • Parallel for loops • Access to fast libraries (cuRAND, cuFFT, cuBLAS)

  13. Compile NumPy array expressions import numbapro from numba import autojit @autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

  14. Create parallel-for loops “prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma) import numbapro # import first to make prange available from numba import autojit, prange @autojit def parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

  15. Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python! from numbapro import vectorize from math import sin, pi @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,100) y = sinc(x)

  16. Ufuncs in parallel (multi-thread or GPU) x = numpy.linspace(-5,5,1000) from numbapro import vectorize y = sinc(x) # On GPU from math import sin z = sinc2(x) # Multiple CPUs @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

  17. Example Benchmark Overhead of memory transfer is over-come after about 1 million floats for simple computation About 1ms of overhead for memory transfer and set-up

  18. Using Vectorize from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j Kind Time Speed-up for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: Python 263.6 1.0x return i return 255 CPU 2.639 100x Tesla S2050 GPU 0.1676 1573x

  19. Grouping Calculations Create “generalized ufuncs” whose elements are “arrays” prototype = "void(float32[:,:], float32[:,:], float32[:,:])" # creates an array of 1000 x 2 x 4 @guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu') A = np.arange(matrix_ct * 2 * 4, def matmul(A, B, C): dtype=np.float32 m, n = A.shape ).reshape(1000, 2, 4) n, p = B.shape # creates an array of 1000 x 4 x 5 for i in range(m): B = np.arange(matrix_ct * 4 * 5, for j in range(p): dtype=np.float32 C[i, j] = 0 ).reshape(1000, 4, 5) for k in range(n): # outputs an array of 1000 x 2 x 5 C[i, j] += A[i, k] * B[k, j] C = matmul(A, B)

  20. Using cuBLAS import numpy as np from numbapro.cudalib import cublas A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N)) B = np.array(np.arange(N) + 10, dtype=A.dtype) D = np.zeros_like(A, order='F') # NumPy E = np.dot(A, np.diag(B)) # cuBLAS blas = cublas.Blas() blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)

  21. FFT Convolution with cuFFT works with 1d,2d,3d @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b from numbapro.cudalib import cufft # host -> device d_img = cuda.to_device(img) # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host()

  22. Monte-Carlo Pricing and cuRAND from numbapro import vectorize, cuda from numbapro.cudalib import curand @vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu') def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) Tuned kernel def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] from numbapro import jit, cuda blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK @jit('void(double[:], double[:], double, double, gridsz = int(math.ceil(float(n) / blksz)) double, double[:])', # Instantiate cuRAND PRNG target='gpu') prng = curand.PRNG(curand.PRNG.MRG32K3A) def step(last, paths, dt, c0, c1, normdist): # Allocate device side array # expands to i = cuda.threadIdx.x + cuda.blockIdx.x d_normdist = cuda.device_array(n, dtype=np.double) * cuda.blockDim.x i = cuda.grid(1) c0 = interest - 0.5 * volatility ** 2 if i >= paths.shape[0]: c1 = volatility * math.sqrt(dt) return # Simulation loop noise = normdist[i] d_last = cuda.to_device(paths[:, 0]) paths[i] = last[i] * math.exp(c0 * dt + c1 * noise) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths

  23. Benchmark http://continuum.io/blog/monte-carlo-pricer

  24. CUDA Python from numbapro import cuda from numba import autojit @autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x CUDA Development i = tid + blkid * blkdim directly in Python if i >= n: return dst[i] = src[i] * scale src = np.arange(N, dtype=np.float) dst = np.empty_like(src) array_scale[grid, block](src, dst, 5.0)

Recommend


More recommend