Tuning Sparse Matrix Vector Multiplication for multi-core SMPs - PowerPoint PPT Presentation

BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology samw@cs.berkeley.edu

Overview BIPS BIPS Multicore is the de facto performance solution for the next decade  Examined Sparse Matrix Vector Multiplication (SpMV) kernel   Important HPC kernel  Memory intensive  Challenging for multicore Present two autotuned threaded implementations:   Pthread, cache-based implementation  Cell local store-based implementation Benchmarked performance across 4 diverse multicore architectures   Intel Xeon (Clovertown) AMD Opteron   Sun Niagara2  IBM Cell Broadband Engine Compare with leading MPI implementation(PETSc) with an autotuned  serial kernel (OSKI)

BIPS Sparse Matrix Vector Multiplication BIPS  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros Requires significant meta data  A x y  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector Difficult to load balance   Very low computational intensity (often >6 bytes/flop)

BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Test Suite  Dataset (Matrices)  Multicore SMPs

Matrices Used BIPS BIPS 2K x 2K Dense matrix stored in sparse format Dense Well Structured (sorted by nonzeros/row) FEM / FEM / Wind FEM / FEM / Protein QCD Economics Epidemiology Spheres Cantilever Tunnel Harbor Ship Poorly Structured hodgepodge FEM / Circuit webbase Accelerator Extreme Aspect Ratio (linear programming) LP  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M

Multicore SMP Systems BIPS BIPS 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim 4MB 4MB 4MB 4MB Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

Multicore SMP Systems BIPS BIPS (memory hierarchy) d e s a 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim b 4MB 4MB 4MB 4MB - Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 e 4GB/s (each direction) h FSB FSB Memory Controller / HT Memory Controller / HT y c h a 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s c C Chipset (4x64b controllers) r a DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) l a r e Fully Buffered DRAM n i o H i Intel Clovertown AMD Opteron t y n r e o v m n o e e FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 M C r o 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC t (fill) Crossbar Switch S FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC y FPU MT UltraSparc 8K D$ MFC 256K SPE SPE l 256K MFC EIB (Ring Network) h EIB (Ring Network) a c c FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) r 90 GB/s o a FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC L r FPU MT UltraSparc 8K D$ MFC 256K SPE e SPE 256K MFC t n i H FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC i o MFC 256K SPE SPE 256K MFC y j 4x128b FBDIMM memory controllers s r BIF BIF o i <<20GB/s D m XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction e M 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

Multicore SMP Systems BIPS BIPS (cache) 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 16MB victim victim victim victim 4MB 4MB 4MB 4MB 4MB Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) (vectors fit) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) 4MB 4MB FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (local store) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

Multicore SMP Systems BIPS BIPS (peak flops) 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim 4MB 4MB 4MB 4MB 75 Gflop/s 17 Gflop/s Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT (w/SIMD) 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) 11 Gflop/s 29 Gflop/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (w/SIMD) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

Multicore SMP Systems BIPS BIPS (peak read bandwidth) 1MB 1MB 1MB 1MB Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 victim victim victim victim 4MB 4MB 4MB 4MB 21 GB/s 21 GB/s Opteron Opteron Opteron Opteron Shared L2 Shared L2 Shared L2 Shared L2 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 179 GB/s FPU MT UltraSparc 8K D$ 4MB Shared L2 (16 way) MFC 256K SPE SPE 256K MFC (fill) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC EIB (Ring Network) EIB (Ring Network) 43 GB/s 51 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC (writethru) 90 GB/s FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s XDR each XDR 42.7GB/s (read), 21.3 GB/s (write) direction 25.6GB/s 25.6GB/s Fully Buffered DRAM XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs - PowerPoint PPT Presentation

BIPS BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2 ,

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix

Matrix Multiplication Matrix multiplication is an operation with properties quite different from

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of

The Input/Output Complexity of Sparse Matrix Multiplication Rasmus Pagh, Morten St ockel IT

Quiz I Give our two primary interpretations of matrix-vector multiplication. I Give the

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Detecting

Cross-Layer and Cross-Domain QoS Signalling Using BGP 8th Wrzburg Workshop on IP: Joint

Content of lecture 1 A. IFRS 15 B. Approach to advanced group statement issues 1. Goodwill

IMPLEMENTING RENEWABLE TECHNOLOGIES: GEORGIA FOCUS By: Montana Busch, Master Electrician,

T he Sc ho o l Pe rspe c tive T RANSL AT I NG I T ST RAT E GI C I NI T I AT I VE

Using Game Theory Nupul Kukreja , William G.J. Halfond, Milind Tambe ASE 2013 1 Outline

Susa san Kaai, i, PhD Univ iver ersit sity of Waterlo terloo, , Onta tario io, , Canada

Modeling and Analysis Issues in the Future Internet Hisashi Kobayashi Princeton University, USA

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us