tuning sparse matrix vector multiplication for multi core
play

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs - PowerPoint PPT Presentation

BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2 ,


  1. BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology samw@cs.berkeley.edu

  2. Overview BIPS BIPS Multicore is the de facto performance solution for the next decade  Examined Sparse Matrix Vector Multiplication (SpMV) kernel   Important HPC kernel  Memory intensive  Challenging for multicore Present two autotuned threaded implementations:   Pthread, cache-based implementation  Cell local store-based implementation Benchmarked performance across 4 diverse multicore architectures   Intel Xeon (Clovertown) AMD Opteron   Sun Niagara2  IBM Cell Broadband Engine Compare with leading MPI implementation(PETSc) with an autotuned  serial kernel (OSKI)

  3. BIPS Sparse Matrix Vector Multiplication BIPS  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros Requires significant meta data  A x y  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector Difficult to load balance   Very low computational intensity (often >6 bytes/flop)

  4. BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Test Suite  Dataset (Matrices)  Multicore SMPs

  5. Matrices Used BIPS BIPS 2K x 2K Dense matrix stored in sparse format Dense Well Structured (sorted by nonzeros/row) FEM / FEM / Wind FEM / FEM / Protein QCD Economics Epidemiology Spheres Cantilever Tunnel Harbor Ship Poorly Structured hodgepodge FEM / Circuit webbase Accelerator Extreme Aspect Ratio (linear programming) LP  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M

  6. Multicore SMP Systems BIPS BIPS 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim 4MB 4MB 4MB 4MB Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

  7. Multicore SMP Systems BIPS BIPS (memory hierarchy) d e s a 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim b 4MB 4MB 4MB 4MB - Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 e 4GB/s (each direction) h FSB FSB Memory Controller / HT Memory Controller / HT y c h a 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s c C Chipset (4x64b controllers) r a DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) l a r e Fully Buffered DRAM n i o H i Intel Clovertown AMD Opteron t y n r e o v m n o e e FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 M C r o 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC t (fill) Crossbar Switch S FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC y FPU MT UltraSparc 8K D$ MFC 256K SPE SPE l 256K MFC EIB (Ring Network) h EIB (Ring Network) a c c FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) r 90 GB/s o a FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC L r FPU MT UltraSparc 8K D$ MFC 256K SPE e SPE 256K MFC t n i H FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC i o MFC 256K SPE SPE 256K MFC y j 4x128b FBDIMM memory controllers s r BIF BIF o i <<20GB/s D m XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction e M 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

  8. Multicore SMP Systems BIPS BIPS (cache) 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 16MB victim victim victim victim 4MB 4MB 4MB 4MB 4MB Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) (vectors fit) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) 4MB 4MB FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (local store) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

  9. Multicore SMP Systems BIPS BIPS (peak flops) 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim 4MB 4MB 4MB 4MB 75 Gflop/s 17 Gflop/s Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT (w/SIMD) 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) 11 Gflop/s 29 Gflop/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (w/SIMD) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

  10. Multicore SMP Systems BIPS BIPS (peak read bandwidth) 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim 4MB 4MB 4MB 4MB 21 GB/s 21 GB/s Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) 43 GB/s 51 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

Recommend


More recommend