� Automatic Performance Tuning and Analysis of Sparse Triangular Solve Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala James W. Demmel, Katherine A. Yelick June 22, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project www.cs.berkeley.edu/ richie/bebop Computer Science Division, U.C. Berkeley Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.1/31
✁ ✁ ✁ Context: High-Performance Libraries Application performance dominated by a few computational kernels Solving PDEs (linear algebra ops) Google (sparse matrix-vector multiply) Multimedia (signal processing) Performance tuning today Vendor-tuned standardized libraries ( e.g. , BLAS) User tunes by hand Automated tuning for dense linear algebra, FFTs, PHiPAC/ATLAS (dense linear algebra) FFTW/SPIRAL/UHFFT (signal processing) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.2/31
✂ Problem Area: Sparse Matrix Kernels Performance issues in sparse linear algebra High bandwidth requirements and poor instruction mix Depends on architecture, kernel, and matrix How to select data structures, algorithms? at run-time? Approach to automatic tuning: for each kernel, Identify and generate a space of implementations Search (models, experiments) to find the fastest one Early success: S PARSITY (Im & Yelick ’99) for sparse matrix-vector multiply (SpM V) This talk: Sparse triangular solve (SpTS), arising in sparse Cholesky and LU factorization (uniprocessor) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.3/31
Sparse Triangular Matrix Example raefsky4 (structural problem) + SuperLU 2.0 + colmmd Dimension: 19779 No. non-zeros: 12.6 M Dense trailing triangle: dim=2268 20% of total nnz Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.4/31
☛ ✌ ✂ ✄ ✄ ☎ ☎ ☎ ☞ ☎ ✟ ☛ ✍ ✌ ✟ ☞ ✌ ✟ ✎ ✠ ✌ ✟ ✄ ✟ ✌ ✍ ☞ ✟ ☛ ✄ ☎ ✟ ☎ ✄ ✠ ☛ ☎ ☛ ✟ ☞ ✌ ✄ Idea: Sparse/Dense Partitioning Partition the matrix into sparse ( ) and dense ( ) ✄✆☎✞✝ ✄✆✟ ✄✡✠ parts: Leads to 1 SpTS, 1 SpM V, and 1 Dense TS: (1) (2) (3) S PARSITY optimizations for (1)–(2); tuned BLAS for (3). Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.5/31
✑ ✒ ✑ ✏ Register Blocking (S PARSITY ) Store dense blocks 4x3 Register Blocking Example Multiply/solve block-by-block 0 5 Fill in explicit zeros 10 1.3x–2.5x speedup on FEM 15 matrices (SpM V) 20 25 Reduced storage overhead 30 over, e.g. , CSR 35 Block ops are fully unrolled – 40 improves register reuse 45 50 Trade-off extra computation for 0 10 20 30 40 50 nz = 598 efficiency Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.6/31
✓ ✔ Tuning Parameter Selection Parameters: switch point , , and register block size , Off-line profiling Benchmark routines on synthetic data Only needed once per architecture At run-time (when matrix is known) Determine or estimate matrix properties ( e.g. , fill ratio, size of trailing triangle) Combine with data collected off-line Convert to new data structure In practice, total run-time cost to select and reorg: e.g. , 10–30 naïve solves on Itanium Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.7/31
✜ ✹ ❁ ☎ ❀ ✙ ✴ ✫ ☎ ❂ ✸ ✸ ✷ ✳ ✲ ✱ ✰ ☎ ✷ ✬ ❁ ❁ ❇ ✟ ❀ ❁ ❉ ☎ ❈ ❃ ✷ ❇ ❅❆ ❂ ☎ ❁ ❄ ✳ ✫ ❇ ☞ ✛ ✣ ☎ ☎ ☎ ✸ ✗ ✓ ✔ ✝ ✓ ✖ ✕ ✟ ❄❊ ✖ ✝ ✪ ✸ ✩ ❇ ✙ ✴ ✧ ★ ✧ ✔ ✔ ✝ ✓ ✖ ✘ ✦ ✛ ❁ ✳ Performance Bounds Upper-bounds on performance (Mflop/s)? Flops: 2 * (number of non-zeros) - (dimension) Full latency cost model of execution time: ✘✚✙ ✗✥✤ ✗✥✤ (4) ✛✢✜ Lower bound on misses: ignore conflict misses on vectors ✺✼✻ ✳✵✴✵✶ ✽✿✾ (5) ✭✯✮ Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.8/31
Performance Results: Intel Itanium Sparse Triangular Solve: Performance Summary −− [itanium−linux−ecc] 350 Reference 325 Reg. Blocking (RB) Switch−to−Dense (S2D) 300 RB + S2D 275 Performance (Mflop/s) Analytic upper bound 250 Analytic lower bound PAPI upper bound 225 200 175 150 125 100 75 50 dense memplus wang4 ex11 raefsky4 goodwin lhr10 matrix Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.9/31
❑ ❏ ● ❑ ❋ ● ❋ ❍ ❋ Conclusions and Directions Limits of “low-level” tuning are near Can we approach bandwidth limits? Other kernels? , , ❋■❍ Other structures? multiple vectors, symmetry, reordering Interface from/to libraries and applications? Leverage existing generators ( e.g. , Bernoulli) Hybrid on-line, off-line optimizations SpTS-specific future work symbolic structure; other fill-reducing orderings refinements to switch point selection Incomplete Cholesky and LU preconditioners Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.10/31
Related Work (1/R) Automatic tuning systems PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Code generation FLAME [GGHvdG01] Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) Sparse performance modeling Temam and Jalby [TJ92] White and Sadayappan [WS97] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.11/31
Related Work (2/R) Compilers (analysis and models); run-time selection CROPS (UCSD/Carter, Ferrante, et al. ) TUNE (Chatterjee, et al. ) Iterative compilation (O’Boyle, et al. , 1998) Broadway (Guyer and Lin, ’99) Brewer (’95); ADAPT (Voss, 2000) Interfaces: Sparse BLAS; PSBLAS; PETSc Sparse triangular solve SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . . Approximation: Alvarado (’93); Raghavan (’98) Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.12/31
—End— Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.13/31
❙ ❛ ❩ ◆ ❯ ❡ ❯ ❘ ◆ P ❫ ◆ ❴ ❵ ❭ ❙ ❩ ❜ ◗ ❝ ❙ ❚ ▲ ❞ ◗ ❚ ❲ ❘ ❚ ❘ ❚ ◗ ✓ ▲ ❚ ❙ ❭ ❲ ✔ ❚ ▲ ❙ ❯ P ◆ ❯ ❘ ❙ ❣ ❣ ❲ P ◗ ❩ ❙ ❯ ◗ ◆ ❢ ❙ Tuning Parameter Selection First, select switch point , ; at run-time: Assume matrix in CSR format on input Scan bottom row from diag. until two consecutive zeros found Fill vs. efficiency trade-off Then, select register block size , Maximize, over all , ❬❪❭ ❬❪❭ ▼❖◆ ❲❨❳ ❚❱❯ (6) Total cost to select and reorg.: e.g. , 10–30 naïve solves on Itanium Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.14/31
Matrix Benchmark Suite Dense Trailing Triangle Nnz % Total Name Application Area Dim. in L Dim. Density Nnz dense Dense matrix 1000 500k 1000 100.0% 100.0% memplus Circuit simulation 17758 2.0M 1978 97.7% 96.8% wang4 Device simulation 26068 15.1M 2810 95.0% 24.8% ex11 Fluid flow 16614 9.8M 2207 88.0% 22.0% raefsky4 Structural mechanics 19779 12.6M 2268 100.0% 20.4% goodwin Fluid mechanics 7320 1.0M 456 65.9% 6.97% lhr10 Chemical processes 10672 369k 104 96.3% 1.43% Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.15/31
Register Profile (Intel Itanium) Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc] 12 1.46 240 11 1.53 1.40 1.45 10 220 9 8 1.48 200 row block size (r) 7 180 6 1.42 5 160 4 1.55 140 1.55 3 1.49 1.49 2 120 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.16/31
Register Profile (IBM Power3) Register Blocking Performance (Mflop/s) [ Dense (n=1000); power3−aix] 260 12 11 1.50 1.54 1.52 240 10 1.47 9 220 8 row block size (r) 200 7 6 1.49 180 5 1.56 4 1.49 1.59 1.47 160 1.47 3 2 140 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.17/31
Register Profile (Sun Ultra 2i) Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris] 12 70 11 1.94 10 65 9 1.98 60 8 2.03 1.95 1.94 1.98 row block size (r) 7 55 6 50 5 1.96 1.98 1.97 1.94 4 45 3 2 40 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.18/31
Register Profile (Intel Pentium III) Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 12 100 11 10 90 9 8 row block size (r) 80 7 6 2.36 70 5 2.39 4 2.44 2.49 60 2.39 2.41 2.37 2.45 3 2.38 2.54 2 50 1 1 2 3 4 5 6 7 8 9 10 11 12 column block size (c) Automatic Performance Tuning and Analysis of Sparse Triangular Solve – p.19/31
Recommend
More recommend