Implementation and performance evaluation of an extended precision floating-point arithmetic library for high-accuracy semidefinite programming Mioara Joldes, Jean-Michel Muller and Valentina Popescu ARITH 24 July 2017 AM P A R CudA Multiple Precision ARithmetic librarY
When do we need more precision? 1 / 14
When do we need more precision? Computing correctly rounded transcendental functions (ex. CRLIBM). 1 / 14
When do we need more precision? Computing correctly rounded transcendental functions (ex. CRLIBM). Dynamical systems field: compute periodic orbits (e.g., finding sinks in the 0.4 H´ enon map, iterating the Lorenz attractor), 0.2 x2 0 celestial mechanics (e.g., long term stability of the solar -0.2 system). -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 x1 1 / 14
When do we need more precision? Computing correctly rounded transcendental functions (ex. CRLIBM). Dynamical systems field: compute periodic orbits (e.g., finding sinks in the 0.4 H´ enon map, iterating the Lorenz attractor), 0.2 x2 0 celestial mechanics (e.g., long term stability of the solar -0.2 system). -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 x1 Optimization problems in experimental mathematics: computation of kissing numbers, bounds for binary codes, problems in control theory and structural design (e.g., the wing of Airbus A380). problems in quantum chemistry/information, etc. ⇒ solved using Semi-Definite Programming (SDP) 1 / 14
Outline Overview on Semi-Definite Programing SDPA-CAMPARY Performance and Numerical Results
What is SDP? convex optimization problem; extension of linear programming; applied to the cone of symmetric matrices with non-negative eigenvalues; the linear vector inequalities are replaced by linear matrix inequalities (LMI). 2 / 14
Formal definition – R n × n the space of size n × n real matrices; – S n ⊂ R n × n the subspace of real symmetric matrices, equipped with the inner product � A, B � S n = tr( A T B ) , where tr( A ) is the trace of A ; – A � O denotes a positive semidefinite matrix A . p ∗ = sup X ∈ S n � C, X � S n (P) s.t. � A i , X � S n = b i , i = 1 , . . . , m, X � O, d ∗ = inf y ∈ R m b T y (D) m � s.t. Y := y i A i − C � O, i =1 for given C, A i ∈ S n × n , i = 1 , . . . , m and b ∈ R m . Classical solving: Primal Dual Interior Point Method (PDIPM) Algorithm. 3 / 14
Existing SDP solvers in double -precision: SeDuMi, SDPT3, CSDP, MOSEK (proprietary software); exact rational arithmetic: SPECTRA; uses interval arithmetic: VSDP; supports higher extended precision: SDPA Family (DD, QD and GMP versions). 4 / 14
Existing SDP solvers in double -precision: SeDuMi, SDPT3, CSDP, MOSEK (proprietary software); exact rational arithmetic: SPECTRA; uses interval arithmetic: VSDP; supports higher extended precision: SDPA Family (DD, QD and GMP versions). SDPA features written in C/C++; starting with v6.0 it incorporates LAPACK for dense matrix computations; more recently it integrated MPACK (multiple-precision linear algebra package based on BLAS and LAPACK); MPACK also offers a GPU tuned implementation in double-double of the Rgemm routine. 4 / 14
Outline Overview on Semi-Definite Programing SDPA-CAMPARY Performance and Numerical Results
What is CAMPARY? CudA Multiple-Precision ARithmetic librarY 5 / 14
What is CAMPARY? CudA Multiple-Precision ARithmetic librarY uses the multiple-term approach for extending the available precision → floating-point expansions; moderate arbitrary precision –few hundred bits– 5 / 14
What is CAMPARY? CudA Multiple-Precision ARithmetic librarY uses the multiple-term approach for extending the available precision → floating-point expansions; moderate arbitrary precision –few hundred bits– targets both CPU and GPU (compilers: GCC, NVCC) underlying FP format: binary32 (up to 12 terms) or binary64 (up to 39 terms) 5 / 14
What is CAMPARY? CudA Multiple-Precision ARithmetic librarY uses the multiple-term approach for extending the available precision → floating-point expansions; moderate arbitrary precision –few hundred bits– targets both CPU and GPU (compilers: GCC, NVCC) underlying FP format: binary32 (up to 12 terms) or binary64 (up to 39 terms) ◦ sequential algorithms: all basic operations ( + / − , × , ÷ , √ ) accurate algorithms - tight error bound “quick-and-dirty” algorithms - does not consider corner cases ⋆ optimized algorithms for double-word arithmetic 5 / 14
What is CAMPARY? CudA Multiple-Precision ARithmetic librarY uses the multiple-term approach for extending the available precision → floating-point expansions; moderate arbitrary precision –few hundred bits– targets both CPU and GPU (compilers: GCC, NVCC) underlying FP format: binary32 (up to 12 terms) or binary64 (up to 39 terms) ◦ sequential algorithms: all basic operations ( + / − , × , ÷ , √ ) accurate algorithms - tight error bound “quick-and-dirty” algorithms - does not consider corner cases ⋆ optimized algorithms for double-word arithmetic ◦ GPU-tuned parallel algorithms: + / − , × 5 / 14
What is CAMPARY? CudA Multiple-Precision ARithmetic librarY uses the multiple-term approach for extending the available precision → floating-point expansions; moderate arbitrary precision –few hundred bits– targets both CPU and GPU (compilers: GCC, NVCC) underlying FP format: binary32 (up to 12 terms) or binary64 (up to 39 terms) ◦ sequential algorithms: all basic operations ( + / − , × , ÷ , √ ) accurate algorithms - tight error bound “quick-and-dirty” algorithms - does not consider corner cases ⋆ optimized algorithms for double-word arithmetic ◦ GPU-tuned parallel algorithms: + / − , × thorough correctness proofs and error analysis 5 / 14
Integrating CAMPARY with MPACK Reminder: MPACK provides a GPU tuned implementation in double-double (DD) for matrix multiplication. 6 / 14
Integrating CAMPARY with MPACK Reminder: MPACK provides a GPU tuned implementation in double-double (DD) for matrix multiplication. 1. we replaced the underlying arithmetic for all CPU routines in DD; 6 / 14
Integrating CAMPARY with MPACK Reminder: MPACK provides a GPU tuned implementation in double-double (DD) for matrix multiplication. 1. we replaced the underlying arithmetic for all CPU routines in DD; 2. we re-implemented the GPU tuned Rgemm using CAMPARY: – classical blocking algorithm is employed; – for each element of a block a thread is created; – a specific number of threads is allocated per block also; – shared memory is used for each block; – reading is done from global memory. 6 / 14
18000 [25] CAMPARY 16000 14000 12000 10000 MFLOPs 8000 6000 4000 2000 0 0 500 1000 1500 2000 Dimension Performance of RGEMM with CAMPARY vs [Nakata2012] in DD on GPU. Max. performance: – 14 . 8 GFlops for CAMPARY, – 16 . 4 GFlops for [Nakata2012]. 7 / 14
1600 3D 4D 5D 6D 1400 8D 1200 1000 MFLOPs 800 600 400 200 0 0 100 200 300 400 500 600 700 800 900 1000 Dimension Performance of RGEMM with CAMPARY for n -double on GPU. Max. performance: – 1 . 6 GFlops for 3D, – 976 MFlops for 4D, – 660 MFlops for 5D, – 453 MFlops for 6D, – 200 MFlops for 8D. 8 / 14
SDPA-CAMPARY 9 / 14
SDPA-CAMPARY 1. started from the SDPA-DD package in which we changed the underlying arithmetic; 9 / 14
SDPA-CAMPARY 1. started from the SDPA-DD package in which we changed the underlying arithmetic; 2. linked the CAMPARY version of MPACK with it; 9 / 14
SDPA-CAMPARY 1. started from the SDPA-DD package in which we changed the underlying arithmetic; 2. linked the CAMPARY version of MPACK with it; 3. tested performance using standard problems from the SDPLIB package; 9 / 14
SDPA-CAMPARY 1. started from the SDPA-DD package in which we changed the underlying arithmetic; 2. linked the CAMPARY version of MPACK with it; 3. tested performance using standard problems from the SDPLIB package; 4. tested accuracy on binary codes problems from Sotirov’s collection. 9 / 14
Outline Overview on Semi-Definite Programing SDPA-CAMPARY Performance and Numerical Results
Recommend
More recommend