Cache-oblivious sparse matrix–vector multiplication Cache-oblivious sparse matrix–vector multiplication Albert-Jan Yzelman April 3, 2009 Joint work with Rob Bisseling Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication Motivations Basic implementations can suffer up to 2x slowdown. Even worse: dedicated libraries may in some cases still show a similar level of inefficiency. Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication Outline Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Memory and multiplication Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Cache parameters Size S (in bytes) Line size L S (bytes) Number of cache lines L = ( S / L S ) Number of subcaches k Number of levels Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Naive cache k = 1, modulo mapped cache Memory (of length L S ) from RAM with start address x is stored in cache line number x mod L : Main memory (RAM) Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication ’Ideal’ cache Instead of using a naive modulo mapping, we use a smarter policy. We take k = L = 4, using ’Least Recently Used (LRU)’ policy: Req. x 1 , . . . , x 4 Req. x 2 Req. x 5 x 1 x 4 x 2 x 5 ⇒ x 3 ⇒ x 4 ⇒ x 2 x 2 x 3 x 4 x 1 x 1 x 3 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Realistic cache 1 < k < L , combining modulo-mapping and the LRU policy Modulo mapping Cache LRU−stack Main memory (RAM) Subcaches Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication Multilevel caches Main memory ✂✁✂✁✂✁✂ �✁�✁�✁� ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ �✁�✁�✁� ✂✁✂✁✂✁✂ Cache Cache ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ CPU ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ ✂✁✂✁✂✁✂ �✁�✁�✁� ✆✁✆✁✆✁✆ ✝✁✝✁✝✁✝ (L1) ✄✁✄✁✄✁✄ ☎✁☎✁☎✁☎ (L2) ✂✁✂✁✂✁✂ �✁�✁�✁� (RAM) Intel Core2 AMD K8 L1: S = 32kB k = 8 L1: S = 16kB k = 2 L2: S = 4MB k = 16 L2: S = 1MB k = 16 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1 · = a 20 a 21 a 22 a 23 x 2 y 2 a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 ⇒ = Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1 · = a 20 a 21 a 22 a 23 x 2 y 2 a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 x 0 ⇒ ⇒ = = Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1 · = a 20 a 21 a 22 a 23 x 2 y 2 a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 0 a 00 ⇒ ⇒ ⇒ = = x 0 = Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1 · = a 20 a 21 a 22 a 23 x 2 y 2 a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 x 0 a 00 y 0 ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = x 0 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1 · = a 20 a 21 a 22 a 23 x 2 y 2 a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 a 01 x 0 a 00 y 0 x 1 ⇒ ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = y 0 = x 0 a 00 x 0 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The dense case Dense matrix–vector multiplication a 00 a 01 a 02 a 03 x 0 y 0 a 10 a 11 a 12 a 13 x 1 y 1 · = a 20 a 21 a 22 a 23 x 2 y 2 a 30 a 31 a 32 a 33 x 3 y 3 Example with k = L = 2: x 0 a 00 y 0 x 1 a 01 y 0 x 0 a 00 y 0 x 1 a 01 ⇒ ⇒ ⇒ ⇒ ⇒ = = x 0 = a 00 = y 0 = x 1 x 0 a 00 a 00 x 0 x 0 Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication When k , L are a bit larger, we can predict the following: the lower elements from the vector x (that is, x 0 , x 1 , . . . , x i for some i < n ) are evicted while processing the entire first row. This causes O ( n ) cache misses on the remaining m − 1 rows. Fix: stop processing a row before an element from x would be evicted and first continue row-wise: i.e., process Ax by doing MVs on m × q submatrices: y = a 0 x + a 1 x + . . . Unwanted side effect: now lower elements from the vector y can be prematurely evicted... Fix: stop processing a submatrix before an element from y would be evicted; the MV routine now is applied on p × q submatrices. This approach is cache-aware ; implemented in, e.g., GotoBLAS. Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Standard datastructure: Compressed Row Storage (CRS) Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Standard datastructure: Compressed Row Storage (CRS) 4 1 3 0 0 0 2 3 A = 1 0 0 2 7 0 1 1 Stored as: nzs: [4 1 3 2 3 1 2 7 1 1] col: [0 1 2 2 3 0 3 0 2 3] row: [0 3 5 7 10] Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? = ⇒ Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? x ? = ⇒ = ⇒ Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a 0? x ? = = ⇒ = ⇒ ⇒ Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? x ? a 0? y 0 x ? = a 0? = ⇒ = ⇒ ⇒ = ⇒ x ? Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? x ? a 0? y 0 x ? x ? = a 0? y 0 = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? x ? Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Memory and multiplication The sparse case Sparse matrix–vector multiplication (SpMV) x ? a 0? y 0 x ? a ?? y ? x ? a 0? y 0 x ? a ?? x ? = a 0? y 0 x ? = ⇒ = ⇒ ⇒ = ⇒ = ⇒ x ? a 0? y ? x ? a 0? x ? We cannot predict memory accesses in the sparse case! Albert-Jan Yzelman & Rob Bisseling
Cache-oblivious sparse matrix–vector multiplication > Cache-friendly data structures Cache-friendly data structures Memory and multiplication 1 Cache-friendly data structures 2 Cache-oblivious sparse matrix structure 3 Obtaining SBD form using partioners 4 Experimental results 5 Conclusions & Future Work 6 Albert-Jan Yzelman & Rob Bisseling
Recommend
More recommend