probabilistic low rank matrix completion with adaptive
play

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral - PowerPoint PPT Presentation

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran cois Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of


  1. Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran¸ cois Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of Bordeaux) F. Caron 1 / 44

  2. Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 2 / 44

  3. Matrix Completion ◮ Netflix prize ◮ 480k users and 18k movies providing 1-5 ratings ◮ 99% of the ratings are missing ◮ Objective: predict missing entries in order to make recommendations Movies . . .   × × × 1 4 . . .   × × × 1 × . . .     × × × Users 2 5 . . .     × × 3 1 4 . . . ... . . . . . . . . . . . . . . . . . . F. Caron 3 / 44

  4. Matrix Completion Objective Complete a matrix X of size m × n from a subset of its entries Applications ◮ Recommender systems ◮ Image inpainting ◮ Imputation of missing data   × × × . . . ✷ ✷   × × × × . . . ✷     × × × ✷ ✷ . . .     × × . . . ✷ ✷ ✷ . . . . . . . . . . . . . . . . . . F. Caron 4 / 44

  5. Matrix Completion ◮ Potentially large matrices (each dimension of order 10 4 − 10 6 ) ◮ Very sparsely observed (1%-10%) F. Caron 5 / 44

  6. Low rank Matrix Completion ◮ Assume that the complete matrix Z is of low rank B T ≃ Z A ���� ���� ���� m × n m × k k × n with k ≪ min( m, n ) .   � � � � � . . .   � � � � � . . . � � � � � . . .     � � � � � � � � . . .     � � � � � � � � . . .         � � � � � � � � . . .         . . . � � � � � � � � .. .. .. .. .. .. .. .. . . . F. Caron 6 / 44

  7. Low rank Matrix Completion ◮ Assume that the complete matrix Z is of low rank B T ≃ Z A ���� ���� ���� m × n m × k k × n with k ≪ min( m, n ) . Items   � � � � � . . .   Features → � � � � � . . . ↓ � � � � � . . .     � � � � � � � � . . .     � � � � � � � � . . .         Users � � � � � � � � . . .         . . . � � � � � � � � .. .. .. .. .. .. .. .. . . . F. Caron 6 / 44

  8. Low rank Matrix Completion ◮ Let Ω ⊂ { 1 , . . . , m } × { 1 , . . . , n } be the subset of observed entries ◮ For ( i, j ) ∈ Ω iid ∼ N (0 , σ 2 ) X ij = Z ij + ε ij , ε ij where σ 2 > 0 F. Caron 7 / 44

  9. Low rank Matrix Completion ◮ Optimization problem � 1 ( X ij − Z ij ) 2 minimize + λ rank ( Z ) 2 σ 2 Z ( i,j ) ∈ Ω � �� � � �� � – loglikelihood penalty where λ > 0 is some regularization parameter. ◮ Non-convex ◮ Computationally hard for general subset Ω F. Caron 8 / 44

  10. Low rank Matrix Completion ◮ Matrix completion with nuclear norm penalty � 1 ( X ij − Z ij ) 2 + λ � Z � ∗ minimize 2 σ 2 Z ( i,j ) ∈ Ω � �� � � �� � – loglikelihood penalty where � Z � ∗ is the nuclear norm of Z , or the sum of the singular values of Z . ◮ Convex relaxation of the rank penalty optimization [Fazel, 2002, Cand` es and Recht, 2009, Cand` es and Tao, 2010] F. Caron 9 / 44

  11. Low rank Matrix Completion Soft-Impute algorithm ◮ Start with an initial matrix Z (0) ◮ At each iteration t = 1 , 2 , . . . ◮ Replace the missing elements in X with those in Z ( t − 1) ◮ Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z ( t ) [Mazumder et al., 2010] F. Caron 10 / 44

  12. Low rank Matrix Completion Soft-Impute algorithm ◮ Soft-thresholded SVD yields a low-rank representation ◮ Each iteration decreases the value of the nuclear norm objective function towards its minimum ◮ Various strategies proposed to scale the algorithm to problems where n, m of order 10 6 ◮ Same shrinkage applied to all singular values [Mazumder et al., 2010] F. Caron 11 / 44

  13. Contributions ◮ Probabilistic interpretation of the nuclear norm objective function ◮ Maximum A Posteriori estimation assuming exponential priors on the singular values ◮ Soft-Impute = Expectation-Maximization algorithm ◮ Construction of alternative non-convex objective functions building on hierarchical priors ◮ Bridge the gap between the rank penalty and the nuclear norm penalty ◮ EM: Adaptative algorithm that iteratively adjusts the shrinkage coefficients for each singular value ◮ Similar to adaptive lasso in multivariate regression ◮ Numerical results show the interest of the approach on various datasets F. Caron 12 / 44

  14. Outline Introduction Complete case Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Matrix completion Experiments Conclusion and perspectives F. Caron 13 / 44

  15. Nuclear Norm penalty ◮ Complete matrix X ◮ Nuclear norm objective function 1 2 σ 2 || X − Z || 2 F + λ || Z || ∗ minimize Z where || · || 2 F is the Frobenius norm ◮ Global solution given by a soft-thresholded SVD � Z = S λσ 2 ( X ) V T with where S λ ( X ) = � U � D λ � D λ = diag (( � � d 1 − λ ) + , . . . , ( � d r − λ ) + ) and t + = max( t, 0) . [Cai et al., 2010, Mazumder et al., 2010] F. Caron 14 / 44

  16. Nuclear Norm penalty ◮ Maximum A Posteriori (MAP) estimate � [log p ( X | Z ) + log p ( Z )] Z = arg max Z under the prior p ( Z ) ∝ exp ( − λ � Z � ∗ ) where Z = UDV T with D = diag ( d 1 , d 2 , . . . , d r ) , and iid ∼ Haar uniform prior on unitary matrices U, V iid ∼ Exp( λ ) d i F. Caron 15 / 44

  17. Hierarchical adaptive spectral penalty ◮ Each singular value has its own random shrinkage coefficient ◮ Hierarchical model, for each singular value i = 1 , . . . , r d i | γ i ∼ Exp( γ i ) γ i ∼ Gamma( a, b ) ◮ Marginal distribution over d i : � ∞ ab a p ( d i ) = Exp( d i ; γ i ) Gamma( γ i ; a, b ) dγ i = ( d i + b ) a +1 0 Pareto distribution with heavier tails than exponential distribution [Todeschini et al., 2013] F. Caron 16 / 44

  18. Hierarchical adaptive spectral penalty 1 β = ∞ β = 2 0.9 β = 0 . 1 0.8 0.7 0.6 p ( d i ) 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 d i Figure: Marginal distribution p ( d i ) with a = b = β ◮ HASP penalty � r pen ( Z ) = − log p ( Z ) = ( a + 1) log( b + d i ) i =1 ◮ Admits as special case the nuclear norm penalty λ || Z || ∗ when a = λb and b → ∞ . F. Caron 17 / 44

  19. Hierarchical adaptive spectral penalty (a) Nuclear norm (b) HASP ( β = 1 ) (c) HASP ( β = (d) Rank penalty 0 . 1 ) (e) ℓ 1 norm (f) HAL ( β = 1 ) (g) HAL ( β = 0 . 1 ) (h) ℓ 0 norm Figure: Top: Manifold of constant penalty, for a symmetric 2 × 2 matrix Z = [ x, y ; y, z ] for (a) the nuclear norm, hierarchical adaptive spectral penalty with a = b = β (b) β = 1 and (c) β = 0 . 1 , and (d) the rank penalty. Bottom: contour of constant penalty for a diagonal matrix [ x, 0; 0 , z ] , where one recovers the classical (e) lasso, (f-g) hierarchical lasso and (h) ℓ 0 penalties. F. Caron 18 / 44

  20. EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate � [log p ( X | Z ) + log p ( Z )] Z = arg max Z i.e. to minimize r � 1 2 σ 2 � X − Z � 2 L ( Z ) = F + ( a + 1) log( b + d i ) i =1 F. Caron 19 / 44

  21. EM algorithm for MAP estimation ◮ Latent variables: γ = ( γ 1 , . . . , γ r ) ◮ E step: Q ( Z, Z ∗ ) = E [log( p ( X, Z, γ )) | Z ∗ , X ] r � 1 2 σ 2 � X − Z � 2 = C − F − ω i d i i =1 a +1 where ω i = E [ γ i | d ∗ i ] = i . b + d ∗ F. Caron 20 / 44

  22. EM algorithm for MAP estimation ◮ M step: r � 1 2 σ 2 � X − Z � 2 minimize F + ω i d i (1) Z i =1 (1) is an adaptive spectral penalty regularized optimization problem, a +1 with weights ω i = i . b + d ∗ d ∗ 1 ≥ d ∗ 2 ≥ . . . ≥ d ∗ r ⇒ 0 ≤ ω 1 ≤ ω 2 ≤ . . . ≤ ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD � Z = S σ 2 ω ( X ) (3) V T with where S ω ( X ) = � U � D ω � D ω = diag (( � � d 1 − ω 1 ) + , . . . , ( � d r − ω r ) + ) . [Ga¨ ıffas and Lecu´ e, 2011] F. Caron 21 / 44

  23. EM algorithm for MAP estimation 6 Nuclear Norm HASP ( β = 2) 5 HASP ( β = 0 . 1) 4 � d i 3 2 1 0 0 1 2 3 4 5 6 � d i Figure: Thresholding rules on the singular values � d i of X The weights will penalize less heavily higher singular values, hence reducing bias. F. Caron 22 / 44

  24. Low rank estimation of complete matrices Hierarchical Adaptive Soft Thresholded (HAST) algorithm for low rank estimation of complete matrices Initialize Z (0) . At iteration t ≥ 1 • For i = 1 , . . . , r , compute the weights ω ( t ) a +1 = i b + d ( t − 1) i • Set Z ( t ) = S σ 2 ω ( t ) ( X ) • If L ( Z ( t − 1) ) − L ( Z ( t ) ) < ε then return � Z = Z ( t ) L ( Z ( t − 1) ) ◮ Admits soft-thresholded SVD operator as a special case when a = bλ and b = β → ∞ . F. Caron 23 / 44

Recommend


More recommend