Simple Optimizations for Applicative Array Programs for Graphics Processors Bradford Larsen Tufts University blarsen@cs.tufts.edu This work was supported in part by the NASA Space Grant Graduate Fellowship and NSF grants IIS-0082577 and OCI-0749125 Monday, February 14, 2011 1
GPUs are powerful, but difficult to program 1 TFLOP/s on modern GPUs; several times greater than CPUs Lots of code for simple operations: float sum = 0; for (int i = 0; i < n; i += 1) sum += arr[i]; in C takes ~150 lines of CUDA GPU code is data-parallel: you must decompose the problem’s data Monday, February 14, 2011 2
Applicative array programming allows easy GPU use vmap f xs a b c d f(a) f(b) f(c) f(d) vzipWith f xs ys Element-wise transformations a b c d f(a, a’) f(b, b’) f(c, c’) f(d, d’) a’ b’ c’ d’ vreduce ⊕ i xs Element-wise a b c d i ⊕ a ⊕ b ⊕ c ⊕ d accumulation vslice (1, 2) xs Subvector a b c d b c Extraction Monday, February 14, 2011 3
The Barracuda language supports these primitives on the GPU Applicative: no side effects Compositional: primitives can be freely nested Deeply embedded within Haskell Functions on vectors, matrices, and scalars are the unit of compilation Monday, February 14, 2011 4
Barracuda code resembles Haskell code on lists rmse :: [Float] -> [Float] -> Float Haskell rmse x y = sqrt (sumDiff / fromIntegral (length x)) lists where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) Barracuda where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Monday, February 14, 2011 5
Barracuda code resembles Haskell code on lists rmse :: [Float] -> [Float] -> Float Haskell rmse x y = sqrt (sumDiff / fromIntegral (length x)) lists where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) Barracuda where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Barracuda code works on GPU vectors, not lists Monday, February 14, 2011 6
Barracuda code resembles Haskell code on lists rmse :: [Float] -> [Float] -> Float Haskell rmse x y = sqrt (sumDiff / fromIntegral (length x)) lists where sumDiff = sum (map (^2) (zipWith (-) x y)) rmse :: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) Barracuda where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Barracuda functions are named differently Monday, February 14, 2011 7
Barracuda functions construct abstract syntax trees rmse:: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Prim1 FSqrt Prim2 FDiv VReduce (+) Prim1 I2F I.e., Barracuda is FConst 0 VMap (^2) VLength x deeply embedded within Haskell VZipWith (-) x y Monday, February 14, 2011 8
Barracuda ASTs are compiled into optimized CUDA code The user User’s C++ code writes these Barracuda runtime nvcc code Barracuda Functions GPGPU Application Barracuda CUDA kernels compiler Barracuda ASTs C++ wrapper functions Monday, February 14, 2011 9
Efficient GPU code exploits the memory hierarchy NVIDIA Tesla C2050 1 100’s 48 KB 32 GPU 3GB Device Shared Cores Memory Memory 1000’s 14 chips Main Memory Monday, February 14, 2011 10
Nested array expressions are potentially troublesome rmse:: VExp Float -> VExp Float -> SExp Float rmse x y = sqrt (sumDiff / fromIntegral (vlength x)) where sumDiff = vsum (vmap (^2) (vzipWith (-) x y)) Prim1 FSqrt Naive compilation Prim2 FDiv uses temporaries, multiple passes VReduce (+) Prim1 I2F over data FConst 0 VMap (^2) VLength x VZipWith (-) x y Monday, February 14, 2011 11
CUDA computes on elements, not arrays CUDA code is data-parallel: kernels describe what happens at one location. Array indexing laws allow for fusion: (vmap f xs)!i = f (xs!i) (vzipWith f xs ys)!i = f (xs!i) (ys!i) (vslice (b, e) xs)!i = xs!(e - b + i) Monday, February 14, 2011 12
Barracuda always applies the array indexing laws Array fusion comes naturally during codegen, e.g.: → (vmap f (vmap g xs))!i f (g (xs!i)) → vmap f (vzipWith g xs ys)!i f (g (xs!i) (ys!i)) → (vslice (b, e) (vmap f xs))!i f (xs!(e - b + i)) → (vmap f (vslice (b, e) xs))!i f (xs!(e - b + i)) → (vslice (b, e) (vslice (b’, e’) xs))!i xs!(e - b + e’ - b’ + i) Monday, February 14, 2011 13
Efficient GPU code exploits the memory hierarchy NVIDIA Tesla C2050 1 100’s 48 KB 32 GPU 3GB Device Shared Cores Memory Memory 1000’s 14 chips Main Memory Monday, February 14, 2011 14
Stencil operations involve redundant reads A data-parallel CUDA kernel is run by many threads on the 14 GPU chips. a b c d e f g h Vector locations 0 1 2 3 4 5 6 7 Block of threads Stencil operations involve array elements in a neighborhood, resulting in several threads reading the same elements. Monday, February 14, 2011 15
Barracuda automatically uses shared memory when useful When multiple array subexpressions overlap, there is read redundancy, e.g.: as = vzipWith (-) zs ys ys = vslice (0, 6) xs zs = vslice (1, 7) xs a b c d e f g h xs ys a b c d e f g b c d e f g h zs Elements b–g are read twice in the computation of as Monday, February 14, 2011 16
Use of shared memory is only useful when array elements are read at least two times; it is known at compile-time that elements are read multiple times; and there are enough elements to amortize the added indexing costs. Monday, February 14, 2011 17
Shared memory optimization examples as = vzipWith (-) zs ys ys = slice (0, 1022) xs There are enough elements zs = slice (1, 1023) xs as = vzipWith (-) zs ys No elements are ys = slice (0, 511) xs read multiple times zs = slice (512, 1023) xs No elements are as = vzipWith (-) zs ys ys = slice (0, 511) xs read multiple times Slices use only constant and as = vzipWith (-) zs ys vector length expressions; ys = slice (0, 1022) xs zs = slice (1, vlength xs) xs there are enough elements Monday, February 14, 2011 18
A mix of existing and new benchmarks was used BLAS operations, Black-Scholes seen in Lee et al. (2009) and Mainland and Morrisett (2010) Weighted moving average, RMSE, forward difference used to show impact of optimizations Test system: 512MB NVIDIA GeForce 8800GT, CUDA 3.2 Monday, February 14, 2011 19
Barracuda performance is good Runtime relative to hand-coded solutions SDOT 1.05 Slower Black Scholes call options 0.95 0.85 Faster SAXPY 0.75 2^8 2^12 2^16 2^20 2^24 Number of array elements Monday, February 14, 2011 20
Array fusion is essential for good performance RMSE average kernel runtime Manually 1E+04 unfused Fused version Fused version 1.7–2.9x faster 1.1x faster Average runtime (µs) With fusion 1E+03 1E+02 2^8 2^12 2^16 2^20 2^24 Number of array elements Monday, February 14, 2011 21
Use of shared memory greatly improves performance Speedup from shared memory optimization Weighted moving average 8 Forward Speedup difference 4 Jacobi iteration 2 stencil 1 2^8 2^12 2^16 2^20 2^24 Number of array elements Monday, February 14, 2011 22
Speedups are enabled by careful use of declarative programming Barracuda gets speedups through better use of GPU memory. Computation is moved into fast memory through array fusion and shared memory optimization. These optimizations are easy to implement because the source language is applicative and has few primitives. Monday, February 14, 2011 23
Recommend
More recommend