on the linear algebra employed in the mosek conic
play

On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday - PowerPoint PPT Presentation

On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday Jul 13 Erling D. Andersen www.mosek.com MOSEK summary General LP Convex QP SDP MIP CQP Version 8: Work in progress. The conic optimization problem n n


  1. On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday Jul 13 Erling D. Andersen www.mosek.com

  2. MOSEK summary General LP Convex QP SDP MIP CQP • Version 8: Work in progress.

  3. The conic optimization problem n n ¯ � ¯ � � C j , ¯ � min c j x j + X j j =1 j =1 n n ¯ � ¯ A ij , ¯ � � � subject to a ij x j + = b i , i = 1 , . . . , m , X j j =1 j =1 x ∈ K , ¯ X j � 0 , j = 1 , . . . , ¯ n . Explanation: • x j is a scalar variable. • ¯ X j is a square matrix variable.

  4. • K represents Cartesian product of conic quadratic constraints e.g. x 1 ≥ � x 2: n � . • ¯ X j � 0 represents ¯ X j = ¯ X T and ¯ X j is PSD. j • ¯ C j and ¯ A j are required to be symmetric. • � A , B � := tr( A T B ). • Dimensions are large. • Data matrices are typically sparse. • A has ≤ 10 nonzeros per column on average usually. • ¯ A ij contains few nonzeros and/or is low rank.

  5. The algorithm: Simplified version • Step 1: Setup the homogeneous and self-dual model. • Step 2. Choose a starting point. • Step 3: Compute Nesterov-Todd search direction. • Step 4: Take a step. • Step 5: Stop if the trial solution is good enough. • Step 6: Goto 3.

  6. Search direction computation Requires solution of: − ( WW T ) − 1 A T       0 d x r x − ( ¯ W ¯ ¯  = W T ) − 1 A T 0 d ¯ r ¯ x x      ¯ 0 d y r y A A where • W and ¯ W are nonsingular block diagonal matrices. • WW T is a diagonal matrix + low rank terms.

  7. Reduced Schur system approach We have (( AW )( AW ) T + ( ¯ A ¯ W )( ¯ A ¯ W ) T ) d y = · · · and − ( WW T )( r x − A T d y ) , d x = − ( ¯ W ¯ x − ¯ W T )( r ¯ A T d y ) . = d ¯ x Cons: • Dense columns cause issues. • Numerical stability. Bad condition number. Pros: • A positive definite symmetric system. • Use Cholesky with no pivoting. • Employed in major commercial solvers.

  8. Computing the Schur matrix Assumptions: • Let us focus at: W ) T = ¯ W T ¯ ( ¯ A ¯ W )( ¯ A ¯ A ¯ W ¯ A T . • Only one 1 matrix variable. The general case follows easily. • NT search direction implies W T = R T ⊗ R T W = R ⊗ R and ¯ ¯ where the Kronecker product ⊗ is defined as   R 11 R R 12 R · · · R 21 R R 22 R R ⊗ R =  .   .  . .

  9. Fact: W T ¯ A k ) T vec ( RR T ¯ k ¯ A ¯ W ¯ A T e l = vec ( ¯ e T A l RR T ) .

  10. Exploiting sparsity 1 Compute the lower triangular part of vec ( RR T ¯ T vec ( ¯  A 1 ) T   A 1 RR T )  . . W T ¯ A T = A ¯ ¯ W ¯ . .     . .     vec ( RR T ¯ vec ( ¯ A m ) T A m RR T ) so the l th column is computed as W T ¯ A k ) T vec ( RR T ¯ k ¯ A ¯ W ¯ A T e l = vec ( ¯ e T A l RR T ) , for k ≥ l . Avoid computing W T ¯ k ¯ A ¯ W ¯ e T A T e l A k ) T vec ( RR T ¯ vec ( ¯ A l RR T ) = = 0 if A k = 0 or A l = 0.

  11. Exploiting sparsity 2 Moreover, • R is a dense square matrix. • A i is typically extremely sparse e.g. A i = e k e T k . as observed by J. Sturm for instance. • Wlog assume A i = U i V T + ( U i V T i ) T . i because U i = A i / 2 and V i = I is a valid choice. • In practice U i and V i are sparse and low rank e.g. has few columns. • The new idea!

  12. Recall W T ¯ A k ) T vec ( RR T ¯ k ¯ A ¯ W ¯ A T e l = vec ( ¯ e T A l RR T ) must be computed for all k ≥ l and RR T ¯ RR T ( U l ( V l ) T + ( U l ( V l ) T ) T ) RR T A l RR T = U l ˆ ˆ + ( ˆ U l ˆ V T V T l ) T = l where ˆ RR T U l , := U l ˆ RR T V l . V l :=

  13. • ˆ U l and ˆ V l are dense matrices. • Sparsity in U l and V l are exploited. • Low rank structure is exploited. • Is all of ˆ U l and ˆ V l required?

  14. Observe e T i ( U k V T k + ( U k V T k ) T ) = 0 , ∀ i �∈ I k where I k := { i | U ki : � = 0 ∨ V ki : � = 0 } . Now A k ) T vec ( RR T ¯ vec ( ¯ A l RR T ) k ) T ) vec ( ˆ U l ˆ + ( ˆ U l ˆ vec ( U k V T k + ( U k V T V T V T l ) T ) = l � 2( U k e i ) T ( ˆ U l ˆ + ( ˆ U l ˆ V T V T l ) T )( V k e i ) = l i Therefore, only rows ˆ U l and ˆ V l corresponding to � I k k ≥ l are needed.

  15. Proposed algorithm: • Compute � I k k ≥ l • Compute ˆ U kI K : and ˆ V kI K : . • Compute � 2( U k e i ) T ( ˆ U l ˆ + ( ˆ U l ˆ V T V T l ) T )( V k e i ) l i Possible improvements • Exploit the special case U k : j = α V k : j . • Exploit dense computations e.g. level 3 BLAS when possible and worthwhile.

  16. Summary: • Exploit sparsity as done in SeDuMi by Sturm. • Also able to exploit low rank structure. • Not implemented yet!

  17. Linear algebra summary • Sparse matrix operations e.g. multiplications. • Large sparse matrix factorization e.g. Cholesky. • Including ordering (AMD,GP). • Dense column detection and handling. • Dense sequential level 1,2,3 BLAS operations. • Inside sparse Cholesky for instance. • Sequential INTEL Math Kernel Library is employed extensively. • Eigenvalue computations. • What about the parallelization? • Modern computers have many cores. • Typically from 4 to 12. • Recent customer example had 80.

  18. The parallelization challenge on shared memory • A computer has many cores. • Parallelization using native threads is cumbersome and error prone. • Employ a parallelization framework e.g. Cilk or OpenMP. Other issues; • Exploit caches. • Do not overload the memory bus. • Not fine grained due to threading overhead.

  19. Cilk summary: • Extension to C and C++. • Has a runtime environment that execute tasks in parallel on a number of workers. • Handles the load balancing. • Allows nested/recursive parallelism e.g. • Parallel dense matrix mul. within parallelized sparse Cholesky. • Parallel IPM within B&B. • Is run to run deterministic. • Care must be taken in floating point computatiosn. • Supported by the Intel C compiler, Gcc, Clang.

  20. Example parallelized dense syrk The dense level 3 BLAS syrk operation does C = AA T . Parallelized version using Cilk: If C is small C = AA T else C 21 = A 2: A T cilk spawn gemm 1: C 11 = A 1: A T cilk spawn syrk 1: C 22 = A 2: A T cilk spawn syrk 2: cilk sync Usage of recursion is allowed!

  21. Our experience with cilk • cilk is easy to learn i.e. 3 new keywords. • Nested/recursive parallelism is allowed. • Useful for both sparse and dense matrix computations. • Efficient parallelization is nevertheless hard.

  22. Summary and conclusions • I am behind the schedule with MOSEK version 8. • Proposed a new algorithm for computing the Schur matrix in the semidefinite case. • Discussed the usage of task based parallelization framework exemplified by cilk. • Slides url https://mosek.com/resources/presentations .

Recommend


More recommend