many core computing many core computing
play

Many-core Computing Many-core Computing Can compilers and tools do - PowerPoint PPT Presentation

Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools do the heavy lifting? heavy lifting? Wen-mei Hwu Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT FCRP GSRC, Illinois


  1. Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools do the heavy lifting? heavy lifting? Wen-mei Hwu Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT University of Illinois, Urbana-Champaign University of Illinois, Urbana-Champaign

  2. Outline Outline • Parallel application outlook Parallel application outlook • Heavy lifting in “simple” parallel applications Heavy lifting in “simple” parallel applications • Promising tool strategies and early evidence Promising tool strategies and early evidence • Challenges and opportunities Challenges and opportunities SoC specific opportnities and challenges? 2 2 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  3. The Energy Behind Parallel The Energy Behind Parallel Revolution Revolution Courtesy: 3 year shift John Owens GPU in every PC– massive volume and potential impact GPU in every PC– massive volume and potential impact • Courtesy: John Owens 3 3 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  4. My Predictions My Predictions • Mass market parallel apps will focus on many-core Mass market parallel apps will focus on many-core GPUs in the next three to four years GPUs in the next three to four years • NVIDIA GeForce, ATI Radon, Intel Larrabee NVIDIA GeForce, ATI Radon, Intel Larrabee • “ “Simple” (vector) parallelism Simple” (vector) parallelism • Dense matrix, single/multi-grids, stencils, etc. Dense matrix, single/multi-grids, stencils, etc. • Even “simple” parallelism can be challenging Even “simple” parallelism can be challenging • Memory bandwidth limitation Memory bandwidth limitation • Portability and scalability Portability and scalability • Heterogeneity and data affinity Heterogeneity and data affinity 4 4 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  5. DRAM Bandwidth Trends DRAM Bandwidth Trends • Random access BW Random access BW 1.2% 1.2% of peak for DDR3-1600, of peak for DDR3-1600, 0.8% 0.8% for for GDDR4-1600 (and falling) GDDR4-1600 (and falling) • 3D stacking and optical interconnects will unlikely help. 3D stacking and optical interconnects will unlikely help. 5 5 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  6. Dense Matrix Multiplication Dense Matrix Multiplication Example (G80) Example (G80) Ryoo, et al, PPoPP 2008 140 120 unroll 1 100 unroll 2 GFLOPS 80 unroll 4 60 Cannot run 40 complete unroll 20 0 normal normal normal normal normal normal prefetch prefetch prefetch prefetch prefetch prefetch Optimizations 1x1 1x2 1x4 1x1 1x2 1x4 8x8 tiles 16x16 tiles Memory bandwidth limited Instruction throughput limited Register tiling allows 200 GFOPS Volkov and Demmel , SC’08 6 6 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  7. Example: Convolution – Base Parallel Example: Convolution – Base Parallel Code Code Each parallel task calculates an output element • Each parallel task calculates an output element Figure shows • Figure shows 1D convolution with K=5 kernel • 1D convolution with K=5 kernel • Calculation of 3 output elements Calculation of 3 output elements Highly parallel but memory bandwidth inefficient • Highly parallel but memory bandwidth inefficient • Uses massive threading to tolerate memory latency Uses massive threading to tolerate memory latency Each input element loaded up to K times • Each input element loaded up to K times Input elements in main memory 7 7 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  8. Example: convolution using on-chip caching Example: convolution using on-chip caching • Output elements calculated from cache contents Output elements calculated from cache contents • Each input element loaded only once Each input element loaded only once • Cache pressure – (K-1+N) input elements needed for N Cache pressure – (K-1+N) input elements needed for N output elements output elements • 7/3 = 2.3, 7 7/3 = 2.3, 7 2 /3 2 = 5.4, 7 3 / 3 3 = 12 2 /3 2 = 5.4, 7 3 / 3 3 = 12 • For small caches, the benefit can be significantly reduced due For small caches, the benefit can be significantly reduced due to the high-ratio of additional elements loaded. to the high-ratio of additional elements loaded. Input elements first loaded into cache 8 8 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  9. Example: Streaming for Reduced Example: Streaming for Reduced Cache Pressure Cache Pressure • Each input element is loaded into cache in turn Each input element is loaded into cache in turn • Or a (n-1)D slice in nD convolution Or a (n-1)D slice in nD convolution • All threads consume that input element All threads consume that input element • “ “loop skewing” needed to align the consumption of input loop skewing” needed to align the consumption of input elements elements • This stretches the effective size of the on-chip cache This stretches the effective size of the on-chip cache 9 9 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  10. Many-core GPU Timing Results Many-core GPU Timing Results Time to compute a 3D k 3 -kernel convolution on 4 frames of a • Time to compute a 3D k 3 -kernel convolution on 4 frames of a 720X560 video sequence 720X560 video sequence • All times are in milliseconds All times are in milliseconds • Timed on a Tesla S1070 using one G280 GPU Timed on a Tesla S1070 using one G280 GPU 10 10 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  11. Multi-core CPU Timing Results Multi-core CPU Timing Results • Time to compute a 3D k Time to compute a 3D k 3 -kernel convolution on 4 3 -kernel convolution on 4 frames of a 720X560 video sequence frames of a 720X560 video sequence All times are in milliseconds • All times are in milliseconds Timed on a Dual-Socket Duo-Core 2.4 GHz Opteron • Timed on a Dual-Socket Duo-Core 2.4 GHz Opteron system, all four cores used system, all four cores used 11 11 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  12. Application Example: Up-resolution Application Example: Up-resolution of Video of Video Nearest & bilinear interpolation: + Fast but low quality Bicubic interpolation: + Higher quality but computational intensive 12 12 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  13. Implementation Overview Implementation Overview • Step 1: Find the coefficients of the shifted B- Step 1: Find the coefficients of the shifted B- Splines. Splines. • Two single pole IIR filters along each dimension Two single pole IIR filters along each dimension • Implemented with recursion along scan lines Implemented with recursion along scan lines • Step 2: Use the coefficients to interpolate the Step 2: Use the coefficients to interpolate the image image • FIR filter for bicubic interpolation implemented as a k=4 2D FIR filter for bicubic interpolation implemented as a k=4 2D convolution with (2+16+2) 2 input tiles with halos convolution with (2+16+2) 2 input tiles with halos • Streaming not required due to small 2D kernel, on-chip cache Streaming not required due to small 2D kernel, on-chip cache works well as is. works well as is. • Step 3: DirectX displays from the GPU Step 3: DirectX displays from the GPU 13 13 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  14. Upconversion Results Upconversion Results Parallelize bicubic B-spline interpolation • Parallelize bicubic B-spline interpolation Interpolate QCIF (176x144) to nearly HDTV (1232x1008) • Interpolate QCIF (176x144) to nearly HDTV (1232x1008) Improved quality over typical bilinear interpolation • Improved quality over typical bilinear interpolation Improved speed over typical CPU implementations • Improved speed over typical CPU implementations Measured 350x speedup over un-optimized CPU code • Measured 350x speedup over un-optimized CPU code Estimated 50x speedup over optimized CPU code from inspection of CPU code • Estimated 50x speedup over optimized CPU code from inspection of CPU code Real-time! • Real-time! Hardware IIR FIR CPU Intel Pentium D 5 ms 1689 ms GPU nVidia GeForce 1 ms 4 ms 8800 GTX 14 14 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  15. Application Example: Application Example: Depth-Image Based Rendering Depth-Image Based Rendering • Three main steps: Three main steps: • Depth propagation Depth propagation • Color-based depth enhancement Color-based depth enhancement • Rendering Rendering 15 15 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  16. Color-based depth enhancement Color-based depth enhancement Propagated depth image at color view Depth-color Directional Occlusion Depth edge bilateral disocclusion removal enhancement filtering filling Enhanced depth image Before After Naïve disocclusion filling Directional disocclusion filling Enhanced depth Propagated depth 16 16 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  17. Depth – color bilateral filtering Depth – color bilateral filtering I − G ( I I ) 2 A B σ r  −  G ( x x ) 2 A B σ s G 2 * G 2 σ σ s r 17 17 MPSoc, August 3, 2009 MPSoc, August 3, 2009

  18. DIBR Visual Results DIBR Visual Results Left view Right view Middle view Rendered view 18 18 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Recommend


More recommend