fast sparse matrix vector multiplication by partitioning
play

Fast sparse matrixvector multiplication by partitioning and - PowerPoint PPT Presentation

Fast sparse matrixvector multiplication by partitioning and reordering Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman June, 2011 Albert-Jan Yzelman Fast sparse matrixvector multiplication by


  1. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel Example: sparse matrix, dense vector multiplication Step 2 ( mv ): use received elements from x for multiplication. Step 3 ( fan-in ): send local results to the correct processors; here, y is distributed cyclically, obviously a bad choice. �� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  2. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel Example: sparse matrix, dense vector multiplication The algorithm: 1 for all nonzeroes k from A if column of k is not local request element from x from the appropiate processor synchronise 2 for all nonzeroes k from A do the SpMV for k send all non-local row sums to the correct processor synchronise 3 add all incoming row sums to the corresponding y [ i ] Albert-Jan Yzelman

  3. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP For Multicore, the original model Communication network � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � P P P P P M M M M M may no longer apply. Albert-Jan Yzelman

  4. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP The AMD Phenom II 945e processor has uniform memory access: Core 1 Core 2 Core 3 Core 4 64kB L1 64kB L1 64kB L1 64kB L1 �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� 512kB L2 512kB L2 512kB L2 512kB L2 �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� 6MB shared L3 cache System interface is modelled well by BSP; Albert-Jan Yzelman

  5. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP The Intel Core 2 Q6600 processor has cache-coherent non-uniform memory access (cc-NUMA): Core 1 Core 2 Core 3 Core 4 32kB L1 32kB L1 32kB L1 32kB L1 �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� 4MB L2 4MB L2 System interface is not modelled well by BSP. Leslie G. Valiant, A bridging model for multi-core computing , Lecture Notes in Computer Science , vol. 5193, Springer (2008); pp 13–28. Albert-Jan Yzelman

  6. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP New primitive: Ask for some environment variables: bsp nprocs() bsp pid() Synchronise: bsp sync() Perform “direct” remote memory access (DRMA): bsp put(source, dest, dest PID) bsp get(source, source PID, dest) bsp direct get (source, source PID, dest) Send messages, synchronously (BSMP): bsp send(data, dest PID) bsp qsize() bsp move() Albert-Jan Yzelman

  7. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP MulticoreBSP brings BSP programming to shared-memory architectures. Programmed in standard Java (5 and up), this is a fully object-oriented library containing only 10 primitives, 2 purely virtual functions ( parallel part and sequential part ), and 2 interfaces. Data types which can be communicated with are defined using an interface. This makes MulticoreBSP transparent and easy to learn, have predictable performance, robust (no data racing, no deadlocks), potentially usable for both shared- and distributed-memory systems Albert-Jan Yzelman

  8. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP Alternative (2-step) SpMV algorithm in MulticoreBSP: 1 for all nonzeroes k from A if both row and column of k are local add do the SpMV for k if column of k is not local direct get element from x , and do SpMV for k send all non-local row sums to the correct processor synchronise 2 add all incoming row sums to the corresponding y [ i ] Albert-Jan Yzelman

  9. Fast sparse matrix–vector multiplication by partitioning and reordering > Bulk Synchronous Parallel MulticoreBSP Software is available at: http://www.multicorebsp.com Yzelman and Bisseling, An Object-Oriented BSP Library for Multicore Programming , Concurrancy and Computation: Practice and Experience, 2011 (Accepted for publication). Albert-Jan Yzelman

  10. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Partitioning Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

  11. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning What causes the communication? nonzeroes on the same column distributed to different processors: fan-out communication ���������� �� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  12. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning What causes the communication? nonzeroes on the same row distributed to different processors: fan-in communication �� ���������� �� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���������� �� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���������� �� �� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  13. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning Load balancing s ∈ [0 , P − 1] w ( s ) w i = 1 � For each superstep i , let ¯ be the average P i workload. The load-balance constraint is: w i − w ( s ) max | ¯ | ≤ ǫ ¯ w i , i s where ǫ is the maximum load imbalance parameter. Albert-Jan Yzelman

  14. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning “Shared” columns: communication during fan-out �� �� 1 �� �� �� �� 1 2 4 2 ���� ���� 3 ���� ���� ���� ���� 4 3 7 �� �� ���� ���� 5 �� �� �� �� �� �� 6 �� �� �� �� �� �� 6 8 5 7 �� �� �� �� �� �� �� �� �� �� 8 �� �� �� �� �� �� �� �� �� �� �� �� Column-net model; a cut net means a shared column Albert-Jan Yzelman

  15. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning “Shared” rows: communication during fan-in 1 2 3 4 5 6 7 8 �� �� �� �� �� �� �� �� �� �� �� �� 1 5 3 �� �� �� �� �� �� ���� ���� ���� ���� ���� ���� �� �� �� �� �� �� 6 7 �� �� �� �� ���� ���� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� 8 2 4 �� �� �� �� �� �� �� �� �� �� �� �� Row-net model; a cut net means a shared row Albert-Jan Yzelman

  16. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Automatic nonzero partitioning Catch communication both ways: 1 2 1 2 10 11 3 4 5 6 7 7 4 3 8 9 10 11 9 12 6 12 13 14 8 13 14 5 Fine-grain model; a cut net means either a shared row or column Albert-Jan Yzelman

  17. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning A cut net n i means communication. The number of processors involved is: λ i = # {V i ∩ n i � = ∅} . So the quantity to minimise is: � ( λ i − 1) . C = i Partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph in two so that C is minimised under the load-balance constraint. Albert-Jan Yzelman

  18. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of that hypergraph in two. Kernighan & Lin, An efficient heuristic procedure for partitioning graphs , Bell Systems Technical Journal 49 (1970). Fiduccia & Mattheyses, A linear-time heuristic for improving network partitions , Proceedings of the 19th IEEE Design Automation Conference (1982). Cataly¨ urek & Aykanat, Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication , IEEE Transactions on Parallel Distributed Systems 10 (1999). Bisseling & Vastenhouw, A two-dimensional data distribution method for parallel sparse matrix-vector multiplication , SIAM Review Vol. 47(1), 2005. Albert-Jan Yzelman

  19. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Try both row- and column-net, and choose best ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  20. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts ���������� ���������� ���������� ���������� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  21. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts ���������� ���������� ���������� ���������� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  22. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan partitioning strategy: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts ���������� ���������� ���������� ���������� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  23. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� ���������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Brendan Vastenhouw and Rob H. Bisseling, A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication , SIAM Review, Vol. 47, No. 1 (2005) pp. 67-95 Albert-Jan Yzelman

  24. Fast sparse matrix–vector multiplication by partitioning and reordering > Partitioning Mondriaan: Model the sparse matrix using a hypergraph Partition the vertices of the hypergraph (in two) Recursively keep partitioning the vertex parts Partition the vector elements ✳✁✳ ✲✁✲ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✵✁✵ ✴✁✴ ✶✁✶ ✷✁✷ ✸✁✸ ✹✁✹ ✻✁✻ ✺✁✺ ✽✁✽ ✼✁✼ ✿✁✿ ✾✁✾ ❀✁❀ ❁✁❁ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✲✁✲ ✳✁✳ ✵✁✵ ✴✁✴ ✷✁✷ ✶✁✶ ✸✁✸ ✹✁✹ ✻✁✻ ✺✁✺ ✽✁✽ ✼✁✼ ✿✁✿ ✾✁✾ ❁✁❁ ❀✁❀ ✲✁✲ ✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞✁✞ ✴✁✴ ✶✁✶ ✸✁✸ ✺✁✺ ✼✁✼ ✾✁✾ ❀✁❀ ✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟✁✟ ✳✁✳ ✵✁✵ ✷✁✷ ✹✁✹ ✻✁✻ ✽✁✽ ✿✁✿ ❁✁❁ �✁� ✫✁✫ ✂✁✂ ✪✁✪ ✄✁✄ ☎✁☎ ✪✁✪ �✁� ✂✁✂ ✫✁✫ ☎✁☎ ✄✁✄ �✁� ✪✁✪ ✫✁✫ ✂✁✂ ☎✁☎ ✄✁✄ ✱✁✱ ✰✁✰ ✂✁✂ �✁� ✱✁✱ ✰✁✰ �✁� ✂✁✂ ✱✁✱ ✰✁✰ �✁� ✦✁✦ ✧✁✧ ✂✁✂ ✎✁✎ ✏✁✏ ✑✁✑ ✒✁✒ ✔✁✔ ✓✁✓ ✂✁✂ ✦✁✦ ✧✁✧ �✁� ✎✁✎ ✏✁✏ ✒✁✒ ✑✁✑ ✓✁✓ ✔✁✔ �✁� ✧✁✧ ✦✁✦ ✂✁✂ ✏✁✏ ✎✁✎ ✑✁✑ ✒✁✒ ✓✁✓ ✔✁✔ ✥✁✥ ✤✁✤ ✕✁✕ ✖✁✖ �✁� ✂✁✂ ✥✁✥ ✤✁✤ ✖✁✖ ✕✁✕ ✂✁✂ �✁� ✤✁✤ ✥✁✥ ✕✁✕ ✖✁✖ ✂✁✂ ✣✁✣ �✁� ✢✁✢ ✆✁✆ ✝✁✝ ✍✁✍ ✌✁✌ �✁� ✢✁✢ ✂✁✂ ✣✁✣ ✝✁✝ ✆✁✆ ✍✁✍ ✌✁✌ �✁� ✢✁✢ ✆✁✆ ✌✁✌ ✂✁✂ ✣✁✣ ✝✁✝ ✍✁✍ ✮✁✮ ✯✁✯ �✁� ✂✁✂ ✯✁✯ ✮✁✮ �✁� ✂✁✂ ✯✁✯ ✮✁✮ ✭✁✭ ✂✁✂ �✁� ✬✁✬ ✗✁✗ ✘✁✘ ✙✁✙ ✚✁✚ ✛✁✛ ✜✁✜ �✁� ✬✁✬ ✭✁✭ ✂✁✂ ✗✁✗ ✘✁✘ ✙✁✙ ✚✁✚ ✜✁✜ ✛✁✛ ✬✁✬ ✂✁✂ ✭✁✭ �✁� ✗✁✗ ✘✁✘ ✙✁✙ ✚✁✚ ✜✁✜ ✛✁✛ ✩✁✩ ★✁★ ☞✁☞ ☛✁☛ ✠✁✠ ✡✁✡ �✁� ✂✁✂ ✩✁✩ ★✁★ ☞✁☞ ☛✁☛ ✡✁✡ ✠✁✠ �✁� ✂✁✂ ★✁★ ✩✁✩ ☛✁☛ ☞✁☞ ✠✁✠ ✡✁✡ Rob H. Bisseling and Wouter Meesen, Communication balancing in parallel sparse matrix-vector multiplication , Electronic Transactions on Numerical Analysis, Vol. 21 (2005) pp. 47-65 Albert-Jan Yzelman

  25. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Sequential SpMV Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

  26. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Realistic cache 1 < k < L , combining modulo-mapping and the LRU policy Modulo mapping Cache LRU−stack Main memory (RAM) Subcaches Albert-Jan Yzelman

  27. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 ⇒ = Albert-Jan Yzelman

  28. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 x 0 ⇒ ⇒ = = Albert-Jan Yzelman

  29. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 0 a 00 ⇒ ⇒ x 0 = ⇒ = = Albert-Jan Yzelman

  30. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 1 x 0 a 00 y 0 ⇒ ⇒ x 0 = ⇒ a 00 ⇒ = = = x 0 Albert-Jan Yzelman

  31. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 1 a 01 x 0 a 00 y 0 x 1 ⇒ ⇒ x 0 = ⇒ a 00 ⇒ y 0 ⇒ = = = = x 0 a 00 x 0 Albert-Jan Yzelman

  32. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The dense case Dense matrix–vector multiplication  a 00 a 01 a 02 a 03   x 0   y 0  a 10 a 11 a 12 a 13 x 1 y 1        ·  =       a 20 a 21 a 22 a 23 x 2 y 2     a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 3: x 0 a 00 y 0 x 1 a 01 y 0 x 0 a 00 y 0 x 1 a 01 ⇒ ⇒ x 0 = ⇒ a 00 ⇒ y 0 ⇒ x 1 = = = = x 0 a 00 a 00 x 0 x 0 Albert-Jan Yzelman

  33. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Standard datastructure: Compressed Row Storage (CRS) Albert-Jan Yzelman

  34. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? = ⇒ Albert-Jan Yzelman

  35. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a 0? x ? = = ⇒ = ⇒ ⇒ Albert-Jan Yzelman

  36. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? x ? a 0? y 0 x ? x ? = a 0? y 0 = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? x ? Albert-Jan Yzelman

  37. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? y ? x ? a 0? y 0 x ? a ?? x ? = a 0? y 0 x ? = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? y ? x ? a 0? x ? We cannot predict memory accesses in the sparse case! Adapt sparse matrix data structures for locality and lower bandwidth? Albert-Jan Yzelman

  38. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV CRS   4 1 3 0 0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] , 2 nnz + ( m + 1) accesses row: [0 3 5 7 10] Albert-Jan Yzelman

  39. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Incremental CRS  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col increment: [0 1 1 4 1 1 3 1 2 1] , 2 nnz + m accesses row increment: [0 1 1 1] Note: accesses like plain CRS, but requires less instructions for SpMV Reference: Joris Koster, Parallel templates for numerical linear algebra, a high-performance computation library (Masters Thesis), 2002. Albert-Jan Yzelman

  40. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Blocked CRS   4 1 3 0 0 0 2 3   A =  , dense blocks: 4 , 1 , 3 / 2 , 3 / 1 / 2 / 7 , 0 , 1 , 1   1 0 0 2  7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 0 1 1] blk: [0 3 5 6 7 11] , nnz + (2 nblk + 1) + ( m + 1) accesses col: [0 2 0 3 0] row: [0 1 2 4 5] Reference: Pinar and Heath, Improving Performance of Sparse Matrix-Vector Multiplication , 1999 Albert-Jan Yzelman

  41. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Fractal datastructures (triplets)   4 1 0 2 0 2 0 3   A =   1 0 0 2   7 0 1 0 Stored as: nzs: [7 1 4 1 2 2 3 2 1] i : [3 2 0 0 1 0 1 2 3] , 3nnz accesses per nonzero j : [0 0 0 1 1 3 3 3 2] Reference: Haase, Liebmann and Plank, A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices , 2005 Albert-Jan Yzelman

  42. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Zig-zag CRS Change the order of CRS: Albert-Jan Yzelman

  43. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Zig-zag CRS  4 1 3 0  0 0 2 3   A =   1 0 0 2   7 0 1 1 Stored as: nzs: [4 1 3 3 2 1 2 1 1 7] col: [0 1 2 3 2 0 3 3 2 0] , 2 nnz + ( m + 1) accesses row: [0 3 5 7 10] Reference: Yzelman and Bisseling, Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods , SISC, 2009 Albert-Jan Yzelman

  44. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Why not also change the input matrix structure? Assume zig-zag CRS ordering (theoretically) Allow only row and column permutations Albert-Jan Yzelman

  45. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form Albert-Jan Yzelman

  46. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form No cache misses 1 cache miss per row 3 cache misses per row 1 cache miss per row Albert-Jan Yzelman

  47. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form No cache misses 1 cache miss per row 3 cache misses 1 cache miss per row 7 cache misses per row 1 cache miss per row 3 cache misses per row 1 cache miss per row Albert-Jan Yzelman

  48. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form 1 2 3 4 1 2 3 4 � (Upper bound on) the number of cache misses: ( λ i − 1) i Albert-Jan Yzelman

  49. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form In 1D, row and column permutations bring the original matrix A in Separated Block Diagonal (SBD) form as follows. A is modelled as a hypergraph H = ( V , N ), with V the set of columns of A , N the set of hyperedges , each element is a subset of V and corresponds to a row of A . A partitioning V 1 , V 2 of V can be constructed; and from these, three hyperedge categories can be constructed: N row as the set of hyperedges with vertices only in V 1 , − N row as the set of hyperedges with vertices both in V 1 and V 2 , c N row the set of remaining hyperedges. + Albert-Jan Yzelman

  50. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Separated Block Diagonal form N row − N row c N row + V 1 V 2 Albert-Jan Yzelman

  51. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Input �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ������������� ������������� �� �� �� �� �� �� ������������� ������������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ������������� ������������� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  52. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column partitioning �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  53. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column permutation �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  54. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Mixed row detection �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  55. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Row permutation �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  56. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column subpartitioning �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  57. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Column permutation �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  58. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form No mixed rows - row permutation �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  59. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  60. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  61. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � �� �� � � �� �� � � �� �� � � �� �� � � �� �� �� �� �� �� �� �� � � �� �� �� �� �� �� � � �� �� �� �� �� �� � � �� �� � � �� �� � � �� �� �� �� � � �� �� � � �� �� � � �� �� �� �� � � �� �� �� �� � � �� �� �� �� � �� �� � �� �� � �� �� � �� �� �� �� � � �� �� �� �� � � �� �� � � �� �� � � �� �� � � �� �� � � �� �� �� �� � � �� �� � � �� �� Albert-Jan Yzelman

  62. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � �� �� � � �� �� � � �� �� � � �� �� � � �� �� �� �� �� �� �� �� � � �� �� �� �� �� �� � � �� �� �� �� �� �� � � �� �� � � �� �� � � �� �� �� �� � � �� �� � � �� �� � � �� �� �� �� � � �� �� �� �� � � �� �� �� �� �� �� � � �� �� � �� � �� �� �� �� � �� � �� �� �� �� � � �� �� � � �� �� � � �� �� � � �� �� � � �� �� �� �� � � �� �� � � �� �� Albert-Jan Yzelman

  63. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� �� � �� � � � �� �� �� �� �� � �� � � � �� �� �� �� � �� � �� � � �� �� � � � � �� �� � � � � �� �� �� � � �� � � �� � � �� � � �� � � �� � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � �� � �� �� �� � � � �� � �� �� �� � � �� �� � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  64. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� �� �� � � � � � �� � �� � � �� � � �� � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � �� � � �� �� �� � � � � �� �� �� �� � � �� �� � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  65. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � �� �� � � � � �� �� � � � � �� �� � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � �� �� � � �� �� � � �� � � �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  66. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � �� � �� � � � �� �� � � � �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � �� �� � �� �� � � �� �� � � �� �� � � � �� �� � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  67. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � �� � �� �� �� � � � � �� �� �� �� � � � �� �� � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  68. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� �� � � �� � � � �� � �� � � � �� � �� � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � �� �� � �� �� � � �� �� � � �� �� � � �� � � �� �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  69. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� �� �� � � � � � �� �� � � � �� � � �� � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � �� �� � �� �� � � �� � �� � �� �� � � �� � �� � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  70. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Permuting to SBD form Continued �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� � � � � �� �� �� �� �� �� �� �� � � � � � �� �� � � � �� � � �� � � �� �� �� �� � � � � �� �� �� �� � � � � �� �� �� �� � � � �� �� � �� �� � � �� �� � � �� �� � � �� � �� � �� �� �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� � � � � �� �� �� �� � � � � �� �� � � � � �� �� Albert-Jan Yzelman

  71. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Reordering parameters Taking p = n S , the number of cache misses is strictly bounded by � ( λ i − 1); i : n i ∈N taking p → ∞ yields a cache-oblivious method with the same bound. References: Yzelman and Bisseling, Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods , SIAM Journal on Scientific Computing, 2009 Albert-Jan Yzelman

  72. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD (doubly separated block diagonal) 1D 2D Yzelman and Bisseling, Two-dimensional cache-oblivious sparse matrix–vector multiplication , April 2011 (Revised pre-print); http://www.math.uu.nl/people/yzelman/publications/#pp Albert-Jan Yzelman

  73. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD (doubly separated block diagonal) Using a fine-grain model of the input sparse matrix, individual nonzeros each correspond to a vertex; each row and column has a corresponding net. N row − N row c N row + N col N col N col − + c The quantity minimised remains � i ( λ i − 1). Albert-Jan Yzelman

  74. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD (doubly separated block diagonal) Zig-zag CRS is not suitable for handling 2D SBD! Albert-Jan Yzelman

  75. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ���� ���� �� �� ���� ���� ���� ���� �� �� ���� ���� ���� ���� �� �� ���� ���� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� Albert-Jan Yzelman

  76. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD Albert-Jan Yzelman

  77. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Two-dimensional SBD; block ordering 1 2 1 2 4 � x 3 4 5 3 4 6 2 � x + 2 � y 7 7 6 5 1 4 1 2 2 � y 2 3 4 3 2 � x 7 5 5 6 7 6 Albert-Jan Yzelman

  78. Fast sparse matrix–vector multiplication by partitioning and reordering > Sequential SpMV Bi-directional Incremental CRS (BICRS) �� �� �� �� � � �� �� �� �� � �  4 1 3 0  � � �� �� �� �� � � � � � � 0 0 2 3   � � � � A =   �� �� � � 1 0 0 2   �� �� � � � 7 0 1 1 �� �� � � � �� �� � � � � Stored as: nzs: [3 2 3 1 1 2 1 7 4 1] col increment: [2 4 1 4 -1 5 -3 4 4 1] , row increment: [0 1 2 -1 1 -3] 2 nnz + ( row jumps + 1) accesses Albert-Jan Yzelman

  79. Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV Parallel cache-friendly SpMV Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

  80. Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On distributed-memory architectures Directly use partitioner output: Albert-Jan Yzelman

  81. Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On distributed-memory architectures Or: use both partitioner and reordering output: partition for p → ∞ , but distribute only over the actual number of processors: Albert-Jan Yzelman

  82. Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On shared-memory architectures Use: global version of the matrix A , stored in BICRS, global input vector x , global output vector y . Albert-Jan Yzelman

  83. Fast sparse matrix–vector multiplication by partitioning and reordering > Parallel cache-friendly SpMV On shared-memory architectures Use: global version of the matrix A , stored in BICRS, global input vector x , global output vector y . Multiple threads work simultaneously on contiguous blocks in the BICRS data structure; conflicts only arise on the row-wise separator areas. Use t − 1 synchronisation steps to prevent concurrent writes. Albert-Jan Yzelman

  84. Fast sparse matrix–vector multiplication by partitioning and reordering > Experimental results Experimental results Bulk Synchronous Parallel 1 Partitioning 2 Sequential SpMV 3 Parallel cache-friendly SpMV 4 Experimental results 5 Albert-Jan Yzelman

Recommend


More recommend