fast coordinate descent methods for non negative matrix
play

Fast Coordinate Descent methods for Non-Negative Matrix - PowerPoint PPT Presentation

Non-negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work with


  1. Non-negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work with Cho-Jui Hsieh

  2. Non-negative Matrix Factorization Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF NMF with KL-divergence Non-negative Tensor Factorization (NTF)

  3. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF NMF with KL-divergence Non-negative Tensor Factorization (NTF)

  4. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Problem Definition Input: Given a non-negative matrix V ∈ R m × n and the target rank k Output: two non-negative matrices W ∈ R m × k and H ∈ R n × k , such that WH T is a good approximation to V . Usually m , n ≫ k . How to measure goodness of approximation? Two widely used choices: Least squares NMF: ( V ij − ( WH T ) ij ) 2 � W , H ≥ 0 f ( W , H ) ≡ � V − WH T � 2 min F = i , j KL-divergence NMF: V ij log( V ij / ( WH T ) ij ) − V ij + ( WH T ) ij � W , H ≥ 0 L ( W , H ) ≡ min i , j

  5. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Problem Definition (Cont’d) Applications: text mining, image processing, . . . . Can get more interpretable basis than SVD. To achieve better sparsity, researchers have proposed adding L1 regularization terms on W and H :   1  2 � V − WH T � 2  � � ( W , H ) = arg min F + ρ 1 W ir + ρ 2 H jr W , H ≥ 0   i , r j , r

  6. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Existing Optimization Methods NMF is nonconvex , but is convex when W or H is fixed. Recent methods follow the alternating minimization framework: Iteratively solve min W ≥ 0 f ( W , H ) and min H ≥ 0 f ( W , H ) until convergence. For least squares NMF, each sub-problem can be exactly or approximately solved by Multiplicative rule (Lee and Seung, 2001) 1 Projected gradient method (Lin, 2007) 2 Newton type updates (Kim, Sra and Dhillon, 2007) 3 Active set method (Kim and Park, 2008) 4 Cyclic coordinate descent method (Chichocki and Phan, 2009) 5

  7. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Coordinate Descent Method Update one variable at a time until convergence: ( W , H ) ← ( W + sE ir , H ). Get s by solving a one-variable problem: s : W ir + s ≥ 0 g W min ir ( s ) ≡ f ( W + sE ir , H ) . For square loss, g W has a closed form solution: ir s ∗ = max � 0 , W ir − g ′ ir (0) / g ′′ � ir (0) − W ir , ir (0) = ∇ W ir f ( W , H ) = ( WH T H − VH ) ir , where g ′ W ir f ( W , H ) = ( H T H ) rr . ir (0) = ∇ 2 g ′′

  8. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF NMF with KL-divergence Non-negative Tensor Factorization (NTF)

  9. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Cyclic Coordinate Descent for Least Squares NMF (FastHals) Recently, (Chichocki and Phan, 2009) proposed a cyclic coordinate descent algorithm (FastHals) for least squares NMF. Fixed update sequence: W 11 , , W 1 , 2 , . . . , W 1 , k , W 2 , 1 , . . . , W m , k , . . . , H 1 , 1 , . . . , H n , k , W 1 , 1 , . . . Each update has time complexity O ( k ).

  10. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Variable Selection FastHals updates variables uniformly. However, an efficient algorithm should update variables with frequency proportional to their “importance” ! We propose a Greedy Coordinate Descent method (GCD) for NMF. 4 −3 7.8 x 10 x 10 5 CD FastHals 7.6 4 7.4 values in the solution number of updates Objective Value 7.2 3 5 7 4 2 6.8 3 6.6 2 1 6.4 1 0 0 6.2 0 0 0.5 0.5 1 1 1.5 1.5 2 2 0 2 4 6 8 10 variables in H 6 6 Number of Coordinate Updates 7 x 10 x 10 x 10 # updates vs obj The behavior of FastHals The behavior of GCD

  11. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Greedy Coordinate Descent (GCD) Stategy — select variables which maximally reduce objective function When W ir is selected, the objective function can be reduced by ir s ∗ − 1 2( H T H ) rr ( s ∗ ) 2 , D W ir ≡ f ( W , H ) − f ( W + s ∗ E ir , H ) = − G W where G W ≡ ∇ W f ( W , H ) = WH T H − VH , and s ∗ is the optimal step size. If D W can be easily maintained, we can choose variables with the largest objective function value reduction according to D W .

  12. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) How to maintain D W (objective value reduction) s ∗ can be computed from G W and H T H (from one-variable update rule). = − G W s ∗ − 1 2 ( H T H ) rr ( s ∗ ) 2 , where G W = WH T H − VH . D W ir Therefore, we can maintain D W if G W and H T H are known. When W ir ← W ir + s ∗ , the i th row of G W is changed: + s ∗ ( H T H ) rj ∀ j = 1 , . . . , k . G W ← G W ij ij Therefore, time for maintaining D W is only O ( k ), which has the same time complexity as Cyclic Coordinate Descent!

  13. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Greedy Coordinate Descent (GCD) Follow the alternating minimization framework, our algorithm GCD alternatively updates variables in W and H . When updating one variables in W , we can maintain D W in O ( k ) time. We conduct a sequence of updates on W : W (0) , W (1) , . . . with a corresponding sequence ( D W ) (0) , ( D W ) (1) , . . . When to switch from W ’s updates to H ’s updates? We update variables in W until the maximum function value decrease is small enough. < ǫ p init , where p init = ( D W ) (0) D W max ij j

  14. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Greedy Coordinate Descent (GCD) Initialize H T H , W T W . While (not converged) 1. Compute G W = W ( H T H ) − VH . 2. Compute D W according to G W . 3. Compute p init = max i , r ( D W ir ). 4. For each row i of W - q i = arg max r D W i , r i , q i > ǫ p init - While D W 4.1 Update W i , q i . 4.2 Update W T W and D W 4.3 q i ← arg max r D W ir 5. For updates to H , repeat steps analogous to Steps 1 through 4.

  15. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Comparisons relative Time (in seconds) dataset m n k error GCD FHals PGrad BPivot 10 − 4 10 2.3 2.1 1.7 0.6 Synth03 500 1,000 10 − 4 30 9.3 26.6 12.4 4.0 10 − 4 10 0.21 0.43 0.53 0.56 Synth08 500 1,000 10 − 4 30 0.43 0.77 2.54 2.86 0.0410 2.3 4.0 13.5 10.6 CBCL 361 2,429 49 0.0376 8.9 18.0 45.6 30.9 0.0373 14.6 29.0 84.6 51.5 0.0365 1.8 6.5 9.0 7.4 ORL 10,304 400 25 0.0335 14.1 30.3 98.6 33.9 0.0332 33.0 63.3 256.8 76.5

  16. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Comparisons Results on MNIST ( m = 780 , n = 60000 , # nz = 8994156 , k = 10). 0 10 GCD FastHals Relative function value difference BlockPivot −1 10 −2 10 −3 10 −4 10 −5 10 0 50 100 150 200 250 300 350 time(sec)

  17. Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF Non-negative Matrix Factorization NMF with KL-divergence Non-negative Tensor Factorization (NTF) Outline Non-negative Matrix Factorization 1 Non-negative Matrix Factorization (NMF) Greedy Coordinate Descent (GCD) for least squares NMF NMF with KL-divergence Non-negative Tensor Factorization (NTF)

Recommend


More recommend