Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Fran¸ cois Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini, Marie Chavent (INRIA, U. of Bordeaux) F. Caron 1 / 44
Outline Introduction Complete case Matrix completion Experiments Conclusion and perspectives F. Caron 2 / 44
Matrix Completion ◮ Netflix prize ◮ 480k users and 18k movies providing 1-5 ratings ◮ 99% of the ratings are missing ◮ Objective: predict missing entries in order to make recommendations Movies . . . × × × 1 4 . . . × × × 1 × . . . × × × Users 2 5 . . . × × 3 1 4 . . . ... . . . . . . . . . . . . . . . . . . F. Caron 3 / 44
Matrix Completion Objective Complete a matrix X of size m × n from a subset of its entries Applications ◮ Recommender systems ◮ Image inpainting ◮ Imputation of missing data × × × . . . ✷ ✷ × × × × . . . ✷ × × × ✷ ✷ . . . × × . . . ✷ ✷ ✷ . . . . . . . . . . . . . . . . . . F. Caron 4 / 44
Matrix Completion ◮ Potentially large matrices (each dimension of order 10 4 − 10 6 ) ◮ Very sparsely observed (1%-10%) F. Caron 5 / 44
Low rank Matrix Completion ◮ Assume that the complete matrix Z is of low rank B T ≃ Z A ���� ���� ���� m × n m × k k × n with k ≪ min( m, n ) . � � � � � . . . � � � � � . . . � � � � � . . . � � � � � � � � . . . � � � � � � � � . . . � � � � � � � � . . . . . . � � � � � � � � .. .. .. .. .. .. .. .. . . . F. Caron 6 / 44
Low rank Matrix Completion ◮ Assume that the complete matrix Z is of low rank B T ≃ Z A ���� ���� ���� m × n m × k k × n with k ≪ min( m, n ) . Items � � � � � . . . Features → � � � � � . . . ↓ � � � � � . . . � � � � � � � � . . . � � � � � � � � . . . Users � � � � � � � � . . . . . . � � � � � � � � .. .. .. .. .. .. .. .. . . . F. Caron 6 / 44
Low rank Matrix Completion ◮ Let Ω ⊂ { 1 , . . . , m } × { 1 , . . . , n } be the subset of observed entries ◮ For ( i, j ) ∈ Ω iid ∼ N (0 , σ 2 ) X ij = Z ij + ε ij , ε ij where σ 2 > 0 F. Caron 7 / 44
Low rank Matrix Completion ◮ Optimization problem � 1 ( X ij − Z ij ) 2 minimize + λ rank ( Z ) 2 σ 2 Z ( i,j ) ∈ Ω � �� � � �� � – loglikelihood penalty where λ > 0 is some regularization parameter. ◮ Non-convex ◮ Computationally hard for general subset Ω F. Caron 8 / 44
Low rank Matrix Completion ◮ Matrix completion with nuclear norm penalty � 1 ( X ij − Z ij ) 2 + λ � Z � ∗ minimize 2 σ 2 Z ( i,j ) ∈ Ω � �� � � �� � – loglikelihood penalty where � Z � ∗ is the nuclear norm of Z , or the sum of the singular values of Z . ◮ Convex relaxation of the rank penalty optimization [Fazel, 2002, Cand` es and Recht, 2009, Cand` es and Tao, 2010] F. Caron 9 / 44
Low rank Matrix Completion Soft-Impute algorithm ◮ Start with an initial matrix Z (0) ◮ At each iteration t = 1 , 2 , . . . ◮ Replace the missing elements in X with those in Z ( t − 1) ◮ Perform a soft-thresholded SVD on the completed matrix, with shrinkage λ to obtain the low rank matrix Z ( t ) [Mazumder et al., 2010] F. Caron 10 / 44
Low rank Matrix Completion Soft-Impute algorithm ◮ Soft-thresholded SVD yields a low-rank representation ◮ Each iteration decreases the value of the nuclear norm objective function towards its minimum ◮ Various strategies proposed to scale the algorithm to problems where n, m of order 10 6 ◮ Same shrinkage applied to all singular values [Mazumder et al., 2010] F. Caron 11 / 44
Contributions ◮ Probabilistic interpretation of the nuclear norm objective function ◮ Maximum A Posteriori estimation assuming exponential priors on the singular values ◮ Soft-Impute = Expectation-Maximization algorithm ◮ Construction of alternative non-convex objective functions building on hierarchical priors ◮ Bridge the gap between the rank penalty and the nuclear norm penalty ◮ EM: Adaptative algorithm that iteratively adjusts the shrinkage coefficients for each singular value ◮ Similar to adaptive lasso in multivariate regression ◮ Numerical results show the interest of the approach on various datasets F. Caron 12 / 44
Outline Introduction Complete case Hierarchical adaptive spectral penalty EM algorithm for MAP estimation Matrix completion Experiments Conclusion and perspectives F. Caron 13 / 44
Nuclear Norm penalty ◮ Complete matrix X ◮ Nuclear norm objective function 1 2 σ 2 || X − Z || 2 F + λ || Z || ∗ minimize Z where || · || 2 F is the Frobenius norm ◮ Global solution given by a soft-thresholded SVD � Z = S λσ 2 ( X ) V T with where S λ ( X ) = � U � D λ � D λ = diag (( � � d 1 − λ ) + , . . . , ( � d r − λ ) + ) and t + = max( t, 0) . [Cai et al., 2010, Mazumder et al., 2010] F. Caron 14 / 44
Nuclear Norm penalty ◮ Maximum A Posteriori (MAP) estimate � [log p ( X | Z ) + log p ( Z )] Z = arg max Z under the prior p ( Z ) ∝ exp ( − λ � Z � ∗ ) where Z = UDV T with D = diag ( d 1 , d 2 , . . . , d r ) , and iid ∼ Haar uniform prior on unitary matrices U, V iid ∼ Exp( λ ) d i F. Caron 15 / 44
Hierarchical adaptive spectral penalty ◮ Each singular value has its own random shrinkage coefficient ◮ Hierarchical model, for each singular value i = 1 , . . . , r d i | γ i ∼ Exp( γ i ) γ i ∼ Gamma( a, b ) ◮ Marginal distribution over d i : � ∞ ab a p ( d i ) = Exp( d i ; γ i ) Gamma( γ i ; a, b ) dγ i = ( d i + b ) a +1 0 Pareto distribution with heavier tails than exponential distribution [Todeschini et al., 2013] F. Caron 16 / 44
Hierarchical adaptive spectral penalty 1 β = ∞ β = 2 0.9 β = 0 . 1 0.8 0.7 0.6 p ( d i ) 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 d i Figure: Marginal distribution p ( d i ) with a = b = β ◮ HASP penalty � r pen ( Z ) = − log p ( Z ) = ( a + 1) log( b + d i ) i =1 ◮ Admits as special case the nuclear norm penalty λ || Z || ∗ when a = λb and b → ∞ . F. Caron 17 / 44
Hierarchical adaptive spectral penalty (a) Nuclear norm (b) HASP ( β = 1 ) (c) HASP ( β = (d) Rank penalty 0 . 1 ) (e) ℓ 1 norm (f) HAL ( β = 1 ) (g) HAL ( β = 0 . 1 ) (h) ℓ 0 norm Figure: Top: Manifold of constant penalty, for a symmetric 2 × 2 matrix Z = [ x, y ; y, z ] for (a) the nuclear norm, hierarchical adaptive spectral penalty with a = b = β (b) β = 1 and (c) β = 0 . 1 , and (d) the rank penalty. Bottom: contour of constant penalty for a diagonal matrix [ x, 0; 0 , z ] , where one recovers the classical (e) lasso, (f-g) hierarchical lasso and (h) ℓ 0 penalties. F. Caron 18 / 44
EM algorithm for MAP estimation Expectation Maximization (EM) algorithm to obtain a MAP estimate � [log p ( X | Z ) + log p ( Z )] Z = arg max Z i.e. to minimize r � 1 2 σ 2 � X − Z � 2 L ( Z ) = F + ( a + 1) log( b + d i ) i =1 F. Caron 19 / 44
EM algorithm for MAP estimation ◮ Latent variables: γ = ( γ 1 , . . . , γ r ) ◮ E step: Q ( Z, Z ∗ ) = E [log( p ( X, Z, γ )) | Z ∗ , X ] r � 1 2 σ 2 � X − Z � 2 = C − F − ω i d i i =1 a +1 where ω i = E [ γ i | d ∗ i ] = i . b + d ∗ F. Caron 20 / 44
EM algorithm for MAP estimation ◮ M step: r � 1 2 σ 2 � X − Z � 2 minimize F + ω i d i (1) Z i =1 (1) is an adaptive spectral penalty regularized optimization problem, a +1 with weights ω i = i . b + d ∗ d ∗ 1 ≥ d ∗ 2 ≥ . . . ≥ d ∗ r ⇒ 0 ≤ ω 1 ≤ ω 2 ≤ . . . ≤ ω r (2) Given condition (2), the solution is given by a weighted soft-thresholded SVD � Z = S σ 2 ω ( X ) (3) V T with where S ω ( X ) = � U � D ω � D ω = diag (( � � d 1 − ω 1 ) + , . . . , ( � d r − ω r ) + ) . [Ga¨ ıffas and Lecu´ e, 2011] F. Caron 21 / 44
EM algorithm for MAP estimation 6 Nuclear Norm HASP ( β = 2) 5 HASP ( β = 0 . 1) 4 � d i 3 2 1 0 0 1 2 3 4 5 6 � d i Figure: Thresholding rules on the singular values � d i of X The weights will penalize less heavily higher singular values, hence reducing bias. F. Caron 22 / 44
Low rank estimation of complete matrices Hierarchical Adaptive Soft Thresholded (HAST) algorithm for low rank estimation of complete matrices Initialize Z (0) . At iteration t ≥ 1 • For i = 1 , . . . , r , compute the weights ω ( t ) a +1 = i b + d ( t − 1) i • Set Z ( t ) = S σ 2 ω ( t ) ( X ) • If L ( Z ( t − 1) ) − L ( Z ( t ) ) < ε then return � Z = Z ( t ) L ( Z ( t − 1) ) ◮ Admits soft-thresholded SVD operator as a special case when a = bλ and b = β → ∞ . F. Caron 23 / 44
Recommend
More recommend