blasfeo
play

BLASFEO Gianluca Frison University of Freiburg BLIS retreat - PowerPoint PPT Presentation

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO BLASFEO Basic Linear Algebra Subroutines For Embedded Optimization Gianluca Frison BLASFEO BLASFEO performance dgemm_nt 50 40


  1. BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison BLASFEO

  2. BLASFEO ◮ Basic Linear Algebra Subroutines For Embedded Optimization Gianluca Frison BLASFEO

  3. BLASFEO performance dgemm_nt 50 40 Gflops 30 ◮ Intel Core i7 4800MQ BLASFEO_HP 20 ◮ BLASFEO HP OpenBLAS MKL ◮ OpenBLAS 0.2.19 10 ATLAS BLIS ◮ MKL 2017.2.174 0 0 50 100 150 200 250 300 ◮ ATLAS 3.10.3 matrix size n ◮ BLIS 0.1.6 dpotrf_l ◮ CAVEAT: BLASFEO is 50 BLASFEO_HP OpenBLAS not API compatible with 40 MKL ATLAS BLIS BLAS & LAPACK Gflops 30 20 10 0 0 50 100 150 200 250 300 matrix size n Gianluca Frison BLASFEO

  4. Framework: embedded optimization ◮ Framework: embedded optimization and control Gianluca Frison BLASFEO

  5. Embedded control Model Predictive Control (and Moving Horizon Estimation) ◮ solve an optimization problem on-line ◮ sampling times in milli- or micro-second range Gianluca Frison BLASFEO

  6. Karush-Kuhn-Tucker optimality conditions KKT system (for N = 2) S T A T Q 0 x 0 − q 0       0 0 B T S 0 R 0 u 0 − r 0 0             A 0 B 0 − I − b 0 λ 0             S T A T − I Q 1 x 1 = − q 1  1 1            S 1 R 1 B 1 u 1 − r 1             A 1 B 1 − I − b 1 λ 1       − I Q 2 x 2 − q 2 ◮ Large, structured system of linear equations ◮ Sub-matrices are assumed dense or diagonal Gianluca Frison BLASFEO

  7. Framework: embedded optimization Assumptions about embedded optimization: ◮ Computational speed is a key factor: solve optimization problems in real-time on resources-constrained hardware. ◮ Data matrices are reused several times (e.g. at each optimization algorithm iteration): look for a good data structure. ◮ Structure-exploiting algorithms can exploit the high-level sparsity pattern: data matrices assumed dense. ◮ Size of matrices is relatively small (tens or few hundreds): generally fitting in cache. ◮ Limitations of embedded optimization hardware and toolchain: no external libraries and (static/dynamic) memory allocation Gianluca Frison BLASFEO

  8. The origin ◮ HPMPC: library for High-Performance implementation of solvers for Model Predictive Contorl Gianluca Frison BLASFEO

  9. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 ◮ How to optimize Gflops 15 dsyrk + dpotrf 10 (for embedded 5 optimization) 0 0 50 100 150 200 250 300 ◮ Test operation: matrix size n test dsyrk + dpotrf 14 Q + A · A T � 1 / 2 � L = 12 10 Gflops 8 ◮ NetlibBLAS 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  10. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 Code Generation 10 ◮ e.g. fix the size of the 5 loops: compiler can unroll 0 0 50 100 150 200 250 300 matrix size n loops and avoid branches test dsyrk + dpotrf ◮ need to generate the code 14 for each problem size 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  11. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 5 0 0 50 100 150 200 250 300 OpenBLAS matrix size n test dsyrk + dpotrf 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  12. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 5 0 0 50 100 150 200 250 300 HPMPC - register blocking matrix size n test dsyrk + dpotrf 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  13. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 HPMPC - SIMD instructions 5 ◮ performance drop for n 0 0 50 100 150 200 250 300 matrix size n multiple of 32 - limited test dsyrk + dpotrf cache associativity 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  14. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 Gflops 15 10 HPMPC - panel-major matrix 5 format 0 0 50 100 150 200 250 300 ◮ panel-major matrix format matrix size n test dsyrk + dpotrf ◮ smooth performance 14 12 10 Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  15. Access pattern in optimized BLAS Figure: Access pattern of data in different cache levels for the dgemm routine in GotoBLAS/OpenBLAS/BLIS. Data is packed (on-line) into buffers following the access pattern. Gianluca Frison BLASFEO

  16. Panel-major matrix format T + = · p s ◮ matrix format: can do any operation, also on sub-matrices ◮ matrix elements are stored in (almost) the same order such as the gemm kernel accesses them ◮ optimal ’NT’ gemm variant ( A not-transposed, B transposed) ◮ panels width p s is the same for the left and the right matrix operand, as well as for the result matrix Gianluca Frison BLASFEO

  17. Optimized BLAS vs HPMPC software stack high-level wrapper IPM solver for IPM solver for Part III linear MPC linear MPC Riccati solver for Riccati solver for Part II unconstrained MPC unconstrained MPC linear algebra linear algebra routines routines packing of Part I matrices linear algebra linear algebra kernels kernels HPMPC optimized BLAS based solver based solver Figure: Structure of a Riccati-based IPM for linear MPC problems when implemented using linear algebra in either optimized BLAS or HPMPC. Routines in the orange boxes use matrices in column-major format, routines in the green boxes use matrices in panel-major format. Gianluca Frison BLASFEO

  18. How to optimize syrk+potrf (for embedded optimization) test dsyrk + dpotrf 25 20 HPMPC - merging of linear Gflops 15 algebra routines 10 ◮ specialized kernels for 5 complex operations 0 0 50 100 150 200 250 300 matrix size n ◮ improves small-scale test dsyrk + dpotrf performance 14 ◮ worse large-scale 12 10 performance Gflops 8 6 4 2 0 0 5 10 15 20 matrix size n Gianluca Frison BLASFEO

  19. Merging of linear algebra routines: syrk + potrf 2 = Q + A · A T � 1 / � L = 1 / 2         L 00 ∗ ∗ Q 00 ∗ ∗ A 0 � �  =  +  · A T A T A T L 10 L 11 ∗ Q 10 Q 11 ∗ A 1 =     0 1 2  L 20 L 21 L 22 Q 20 Q 21 Q 22 A 2 0 ) 1 / 2  ( Q 00 + A 0 · A T  ∗ ∗ ( Q 10 + A 1 · A T 0 ) L − T ( Q 11 + A 1 · A T 1 − L 10 · L T 10 ) 1 / 2  ∗  00   0 ) L − T 10 ) L − T ( Q 20 + A 2 · A T ( Q 21 + A 2 · A T 1 − L 20 · L T ( Q 22 + A 2 · A T 2 − L 20 · L T 20 − L 21 · L T 21 ) 1 / 2 00 11 ◮ each sub-matrix computed using a single specialized kernel ◮ reduce number of function calls ◮ reduce number of load and store of the same data Gianluca Frison BLASFEO

  20. The present ◮ BLASFEO Gianluca Frison BLASFEO

  21. BLASFEO ◮ aim: satisfy the need for high-performance linear algebra routines for small dense matrices ◮ mean: adapt high-performance computing techniques to embedded optimization framework ◮ keep: ◮ LA kernels (register-blocking, SIMD) ◮ optimized data layout ◮ drop: ◮ cache-blocking ◮ on-line data packing ◮ multi-thread (at least for now) ◮ add: ◮ optimized (panel-major) matrix format ◮ off-line data packing ◮ assembly subroutines: tailored LA kernels & code reuse Gianluca Frison BLASFEO

  22. Linear algebra level ◮ LA level definition ◮ level 1: O ( n ) storage, O ( n ) flops ◮ level 2: O ( n 2 ) storage, O ( n 2 ) flops ◮ level 3: O ( n 2 ) storage, O ( n 3 ) flops ◮ in level 1 and level 2 ◮ reuse factor O (1) ◮ memory-bounded ◮ in level 3 ◮ reuse factor O ( n ) ◮ compute-bounded for large n ◮ disregard O ( n 2 ) terms ◮ typically memory-bounded for small n ◮ minimize O ( n 2 ) terms ⇒ BLASFEO playground! Gianluca Frison BLASFEO

  23. BLASFEO RF BLASFEO ReFerence ◮ 2x2 blocking for registers ◮ column-major matrix format ◮ small code size ◮ ANSI C code, no external dependencies ◮ good performance for tiny matrices Gianluca Frison BLASFEO

  24. BLASFEO HP BLASFEO High-Performance ◮ optimal blocking for registers + vectorization ◮ no blocking for cache ◮ panel-major matrix format ◮ hand-written assembly kernels ◮ optimized for highest performance for matrices fitting in cache Gianluca Frison BLASFEO

  25. BLASFEO WR BLASFEO WRapper to BLAS and LAPACK ◮ provides a performance basis ◮ column-major matrix format ◮ optimized for many architectures ◮ possibly multi-threaded ◮ good performance for large matrices Gianluca Frison BLASFEO

  26. Matrix structure ◮ problem: how to switch between panel-major and column-major formats? ◮ solution: use a C struct for the matrix ”object” (..., struct d strmat *sA, int ai, int aj, ...) ◮ if column-major, the first matrix element is at int lda = sA->m; double *ptr = sA->pA + ai + aj*lda; ◮ if panel-major, the first matrix element is at int ps = sA->ps; int sda = sA->cn; int air = ai&(ps-1); double *ptr = sA->pA + (ai-air)*sda + air + aj*bs; ◮ custom matrix struct allows other tricks ◮ extra linear storage for inverse of diagonal in factorizations Gianluca Frison BLASFEO

More recommend