sketchy decisions convex low rank matrix optimization
play

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal - PowerPoint PPT Presentation

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL), Volkan Cevher (EPFL), and Joel


  1. Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL), Volkan Cevher (EPFL), and Joel Tropp (Caltech) LCCC, June 15 2017 1 / 30

  2. Desiderata Suppose that the solution to a convex optimization problem has a compact representation . Problem data: O ( n ) ↓ Working memory: O (???) ↓ Solution: O ( n ) Can we develop algorithms that provably solve the problem using storage bounded by the size of the problem data and the size of the solution ? 2 / 30

  3. Model problem: low rank matrix optimization consider a convex problem with decision variable X ∈ R m × n compact matrix optimization problem : minimize f ( A X ) (CMOP) subject to � X � S 1 ≤ α ◮ A : R m × n → R d ◮ f : R d → R convex and smooth ◮ � X � S 1 is Schatten-1 norm: sum of singular values 3 / 30

  4. Model problem: low rank matrix optimization consider a convex problem with decision variable X ∈ R m × n compact matrix optimization problem : minimize f ( A X ) (CMOP) subject to � X � S 1 ≤ α ◮ A : R m × n → R d ◮ f : R d → R convex and smooth ◮ � X � S 1 is Schatten-1 norm: sum of singular values assume ◮ compact specification : problem data use O ( n ) storage ◮ compact solution : rank X ⋆ = r constant 3 / 30

  5. Model problem: low rank matrix optimization consider a convex problem with decision variable X ∈ R m × n compact matrix optimization problem : minimize f ( A X ) (CMOP) subject to � X � S 1 ≤ α ◮ A : R m × n → R d ◮ f : R d → R convex and smooth ◮ � X � S 1 is Schatten-1 norm: sum of singular values assume ◮ compact specification : problem data use O ( n ) storage ◮ compact solution : rank X ⋆ = r constant Note: Same ideas work for X � 0 3 / 30

  6. Are desiderata achievable? minimize f ( A X ) subject to � X � S 1 ≤ α CMOP, using any first order method: Problem data: O ( n ) ↓ Working memory: O ( n 2 ) ↓ Solution: O ( n ) 4 / 30

  7. Are desiderata achievable? CMOP, using ??? : Problem data: O ( n ) ↓ Working memory: O (???) ↓ Solution: O ( n ) 4 / 30

  8. Application: matrix completion find X matching M on observed entries ( i , j ) ∈ Ω ( X ij − M ij ) 2 � minimize subject to � X � S 1 ≤ α ◮ m = rows, n = columns of matrix to complete ◮ d = | Ω | number of observations ◮ A selects observed entries X ij , ( i , j ) ∈ Ω ◮ f ( A X ) = �A X − A M � 2 compact if d = O ( n ) observations and rank( X ⋆ ) constant 5 / 30

  9. Application: matrix completion find X matching M on observed entries ( i , j ) ∈ Ω ( X ij − M ij ) 2 � minimize subject to � X � S 1 ≤ α ◮ m = rows, n = columns of matrix to complete ◮ d = | Ω | number of observations ◮ A selects observed entries X ij , ( i , j ) ∈ Ω ◮ f ( A X ) = �A X − A M � 2 compact if d = O ( n ) observations and rank( X ⋆ ) constant Theorem: ǫ -rank of M grows as log( m + n ) if rows and cols iid (under some technical conditions) (Udell and Townsend, 2017) 5 / 30

  10. Application: Phase retrieval ◮ image with n pixels x ♮ ∈ C n ◮ acquire noisy nonlinear measurements b i = |� a i , x ♮ �| 2 + ω i ◮ relax: if X = x ♮ x ∗ ♮ , then |� a i , x ♮ �| 2 = x ♮ a ∗ i a i x ∗ ♮ = tr( a ∗ i a i x ∗ ♮ x ♮ ) = tr( a ∗ i a i X ) ◮ recover image by solving minimize f ( A X ; b ) subject to tr X ≤ α X � 0 . compact if d = O ( n ) observations and rank( X ⋆ ) constant 6 / 30

  11. Optimal Storage What kind of storage bounds can we hope for? ◮ Assume black-box implementation of A ( uv ∗ ) u ∗ ( A ∗ z ) ( A ∗ z ) v where u ∈ R m , v ∈ R n , and z ∈ R d ◮ Need Ω( m + n + d ) storage to apply linear map ◮ Need Θ( r ( m + n )) storage for a rank- r approximate solution Definition. An algorithm for the model problem has optimal storage if its working storage is Θ( d + r ( m + n )) . 7 / 30

  12. Goal: optimal storage We can specify the problem using O ( n ) ≪ mn units of storage. Can we solve the problem using only O ( n ) units of storage? 8 / 30

  13. Goal: optimal storage We can specify the problem using O ( n ) ≪ mn units of storage. Can we solve the problem using only O ( n ) units of storage? If we write down X , we’ve already failed. 8 / 30

  14. A brief biased history of matrix optimization ◮ 1990s: Interior-point methods ◮ Storage cost Θ(( m + n ) 4 ) for Hessian ◮ 2000s: Convex first-order methods ◮ (Accelerated) proximal gradient and others ◮ Store matrix variable Θ( mn ) ( Interior-point: Nemirovski & Nesterov 1994; . . . ; First-order: Rockafellar 1976; Auslender & Teboulle 2006; . . . ) 9 / 30

  15. A brief biased history of matrix optimization ◮ 2008–Present: Storage-efficient convex first-order methods ◮ Conditional gradient method (CGM) and extensions ◮ Store matrix in low-rank form O ( t ( m + n )) after t iterations ◮ Requires storage Θ( mn ) for t ≥ min( m , n ) ◮ 2003–Present: Nonconvex heuristics ◮ Burer–Monteiro factorization idea + various opt algorithms ◮ Store low-rank matrix factors Θ( r ( m + n )) ◮ For guaranteed solution, need unrealistic + unverifiable statistical assumptions ( CGM: Frank & Wolfe 1956; Levitin & Poljak 1967; Hazan 2008; Clarkson 2010; Jaggi 2013; . . . ; Heuristics: Burer & Monteiro 2003; Keshavan et al. 2009; Jain et al. 2012; Bhojanapalli et al. 2015; Cand` es et al. 2014; Boumal et al. 2015; . . . ) 10 / 30

  16. The dilemma ◮ convex methods: slow memory hogs with guarantees ◮ nonconvex methods: fast, lightweight, but brittle 11 / 30

  17. The dilemma ◮ convex methods: slow memory hogs with guarantees ◮ nonconvex methods: fast, lightweight, but brittle low memory or guaranteed convergence . . . but not both? 11 / 30

  18. Conditional Gradient Method minimize f ( A X ) = g ( X ) subject to � X � S 1 ≤ α � X � S 1 ≤ α X t X t +1 −∇ g ( X t ) H t = argmax � X , −∇ g ( X t ) � � X � S 1 ≤ 1 12 / 30

  19. Conditional Gradient Method minimize f ( A X ) subject to � X � S 1 ≤ α CGM. set X 0 = 0. for t = 0 , 1 , . . . ◮ compute G t = A ∗ ∇ f ( A X t ) ◮ set search direction H t = argmax � X , − G t � � X � S 1 ≤ α ◮ set stepsize η t = 2 / ( t + 2) ◮ update X t +1 = (1 − η t ) X t + η t H t 13 / 30

  20. Conditional gradient method (CGM) features: ◮ relies on efficient linear optimization oracle to compute H t = argmax � X , − G t � � X � S 1 ≤ α ◮ bound on suboptimality follows from subgradient inequality � X t − X ⋆ , G t � f ( A X t ) − f ( A X ⋆ ) ≤ � X t − X ⋆ , A ∗ ∇ f ( A X ) � ≤ �A X t − A X ⋆ , ∇ f ( A X ) � ≤ to provide stopping condition ◮ faster variants: linesearch, away steps, . . . 14 / 30

  21. Linear optimization oracle for MOP compute search direction argmax � X , − G � � X � S 1 ≤ α 15 / 30

  22. Linear optimization oracle for MOP compute search direction argmax � X , − G � � X � S 1 ≤ α ◮ solution given by maximum singular vector of − G : n � σ i u i v ∗ X = α u 1 v ∗ − G = = ⇒ i 1 i =1 ◮ use Lanczos method: only need to apply G and G ∗ 15 / 30

  23. Conditional gradient descent Algorithm 1 CGM for the model problem (CMOP) Input: Problem data for (CMOP); suboptimality ε Output: Solution X ⋆ function CGM 1 X ← 0 2 for t ← 0 , 1 , . . . do 3 ( u , v ) ← MaxSingVec ( −A ∗ ( ∇ f ( A X ))) 4 H ← − α uv ∗ 5 if �A X − A H , ∇ f ( A X ) � ≤ ε then break for 6 η ← 2 / ( t + 2) 7 X ← (1 − η ) X + η H 8 return X 9 16 / 30

  24. Two crucial ideas To solve the problem using optimal storage: ◮ Use the low-dimensional “dual” variable z t = A X t ∈ R d to drive the iteration. ◮ Recover solution from small (randomized) sketch. 17 / 30

  25. Two crucial ideas To solve the problem using optimal storage: ◮ Use the low-dimensional “dual” variable z t = A X t ∈ R d to drive the iteration. ◮ Recover solution from small (randomized) sketch. Never write down X until it has converged to low rank. 17 / 30

  26. Conditional gradient descent Algorithm 2 CGM for the model problem (CMOP) Input: Problem data for (CMOP); suboptimality ε Output: Solution X ⋆ function CGM 1 X ← 0 2 for t ← 0 , 1 , . . . do 3 ( u , v ) ← MaxSingVec ( −A ∗ ( ∇ f ( A X ))) 4 H ← − α uv ∗ 5 if �A X − A H , ∇ f ( A X ) � ≤ ε then break for 6 η ← 2 / ( t + 2) 7 X ← (1 − η ) X + η H 8 return X 9 17 / 30

  27. Conditional gradient descent Introduce “dual variable” z = A X ∈ R d ; eliminate X . Algorithm 3 Dual CGM for the model problem (CMOP) Input: Problem data for (CMOP); suboptimality ε Output: Solution X ⋆ function dualCGM 1 z ← 0 2 for t ← 0 , 1 , . . . do 3 ( u , v ) ← MaxSingVec ( −A ∗ ( ∇ f ( z ))) 4 h ← A ( − α uv ∗ ) 5 if � z − h , ∇ f ( z ) � ≤ ε then break for 6 η ← 2 / ( t + 2) 7 z ← (1 − η ) z + η h 8 17 / 30

Recommend


More recommend