matrix factorization
play

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Low-rank models Matrix completion Structured low-rank models Motivation


  1. Matrix Factorization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

  2. Low-rank models Matrix completion Structured low-rank models

  3. Motivation Quantity y [ i , j ] depends on indices i and j We observe examples and want to predict new instances In collaborative filtering, y [ i , j ] is rating given to a movie i by a user j

  4. Collaborative filtering Bob Molly Mary Larry   1 1 5 4 The Dark Knight 2 1 4 5 Spiderman 3     4 5 2 1 Love Actually   Y :=   5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 5 Superman 2

  5. Simple model Assumptions: ◮ Some movies are more popular in general ◮ Some users are more generous in general y [ i , j ] ≈ a [ i ] b [ j ] ◮ a [ i ] quantifies popularity of movie i ◮ b [ j ] quantifies generosity of user j

  6. Simple model Problem: Fitting a and b to the data yields nonconvex problem Example: 1 movie, 1 user, rating 1 yields cost function ( 1 − ab ) 2 To fix scale set | a | = 1

  7. ( 1 − ab ) 2 b 20 . 0 a 10 . 0 a = − 1 a = +1

  8. Rank-1 model Assume m movies are all rated by n users Model becomes a � b T Y ≈ � We can fit it by solving � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n Equivalent to

  9. Rank-1 model Assume m movies are all rated by n users Model becomes a � b T Y ≈ � We can fit it by solving � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n Equivalent to X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1

  10. Best rank- k approximation Let USV T be the SVD of a matrix A ∈ R m × n The truncated SVD U : , 1 : k S 1 : k , 1 : k V T : , 1 : k is the best rank- k approximation � � � � � � � � � A − � U : , 1 : k S 1 : k , 1 : k V T : , 1 : k = arg min � A � � F { � A | rank ( ˜ A ) = k }

  11. Rank-1 model v T σ 1 � u 1 � = arg X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1 1 The solution to � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n is � a min = � b min =

  12. Rank-1 model v T σ 1 � u 1 � = arg X ∈ R m × n || Y − X || F min subject to rank ( X ) = 1 1 The solution to � � b T � � � � � � a � min � Y − � subject to || � a || 2 = 1 � � � a ∈ R m , � F � b ∈ R n is � a min = � u 1 � b min = σ 1 � v 1

  13. Rank- r model Certain people like certain movies: r factors r � y [ i , j ] ≈ a l [ i ] b l [ j ] l = 1 For each factor l ◮ a l [ i ] : movie i is positively ( > 0), negatively ( < 0) or not ( ≈ 0) associated to factor l ◮ b l [ j ] : user j likes ( > 0), hates ( < 0) or is indifferent ( ≈ 0) to factor l

  14. Rank- r model Equivalent to A ∈ R m × r , B ∈ R r × n Y ≈ AB , SVD solves A ∈ R m × r , B ∈ R r × n || Y − A B || F min subject to || � a 1 || 2 = 1 , . . . , || � a r || 2 = 1 a r , � b 1 , . . . , � Problem: Many possible ways of choosing � a 1 , . . . , � b r SVD constrains them to be orthogonal

  15. Collaborative filtering Bob Molly Mary Larry   1 1 5 4 The Dark Knight 2 1 4 5 Spiderman 3     4 5 2 1 Love Actually   Y :=   5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 5 Superman 2

  16. SVD   7 . 79 0 0 0   0 1 . 62 0 0 1 T = USV T = U A − µ� 1 �    V T  0 0 1 . 55 0 0 0 0 0 . 62 m n � � µ := 1 A ij n i = 1 j = 1

  17. Rank 1 model Bob Molly Mary Larry   1 . 34 ( 1 ) 1 . 19 ( 1 ) 4 . 66 ( 5 ) 4 . 81 ( 4 ) The Dark Knight 1 . 55 ( 2 ) 1 . 42 ( 1 ) 4 . 45 ( 4 ) 4 . 58 ( 5 ) Spiderman 3     4 . 45 ( 4 ) 4 . 58 ( 5 ) 1 . 55 ( 2 ) 1 . 42 ( 1 )   Love Actually ¯ v T A + σ 1 � u 1 � 1 =   4 . 43 ( 5 ) 4 . 56 ( 4 ) 1 . 57 ( 2 ) 1 . 44 ( 1 )   B.J.’s Diary   4 . 43 ( 4 ) 4 . 56 ( 5 ) 1 . 57 ( 1 ) 1 . 44 ( 2 ) Pretty Woman 1 . 34 ( 1 ) 1 . 19 ( 2 ) 4 . 66 ( 5 ) 4 . 81 ( 5 ) Superman 2

  18. Movies D. Knight Sp. 3 Love Act. B.J.’s Diary P. Woman Sup. 2 � a 1 = ( − 0 . 45 − 0 . 39 0 . 39 0 . 39 0 . 39 − 0 . 45 ) Coefficients cluster movies into action (+) and romantic (-)

  19. Users Bob Molly Mary Larry � b 1 = ( 3 . 74 4 . 05 − 3 . 74 − 4 . 05 ) Coefficients cluster people into action (-) and romantic (+)

  20. Low-rank models Matrix completion Structured low-rank models

  21. Netflix Prize ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

  22. Matrix completion Bob Molly Mary Larry   1 ? 5 4 The Dark Knight ? 1 4 5 Spiderman 3     4 5 2 ? Love Actually     5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 ? 5 Superman 2

  23. Matrix completion as an inverse problem � 1 � ? 5 ? 3 2 For a fixed sampling pattern, underdetermined system of equations   Y 11         1 0 0 0 0 0  Y 21  1              0 0 0 1 0 0   Y 12   3        =             0 0 0 0 1 0 5 Y 22           0 0 0 0 0 1 Y 13 2   Y 23

  24. Isn’t this completely ill posed? Assumption: Matrix is low rank, depends on ≈ r ( m + n ) parameters As long as data > parameters recovery is possible (in principle)   1 1 1 1 ? 1   1 1 1 1 1 1     1 1 1 1 1 1 ? 1 1 1 1 1

  25. Matrix cannot be sparse   0 0 0 0 0 0   0 0 0 23 0 0     0 0 0 0 0 0 0 0 0 0 0 0

  26. Singular vectors cannot be sparse       1 0 1 1 1 1   � �   � �   1 0 1 1 1 1       1 1 1 1 + 1 2 3 4 =       1 0 1 1 1 1 1 1 2 3 4 5

  27. Incoherence The matrix must be incoherent: its singular vectors must be spread out For 1 / √ n ≤ µ ≤ 1 1 ≤ i ≤ r , 1 ≤ j ≤ m | U ij | ≤ µ max 1 ≤ i ≤ r , 1 ≤ j ≤ n | V ij | ≤ µ max for the left U 1 , . . . , U r and right V 1 , . . . , V r singular vectors

  28. Measurements We must see an entry in each row/column at least     1 1 1 1 1     � � ? ? ? ? ?      = 1 1 1 1    1 1 1 1 1 1 1 1 1 1 Assumption: Random sampling (usually does not hold in practice!)

  29. Low-rank matrix estimation First idea: X ∈ R m × n rank ( X ) min such that X Ω = y Ω : indices of revealed entries y : revealed entries Computationally intractable because of missing entries Tractable alternative: X ∈ R m × n || X || ∗ min such that X Ω = y

  30. Exact recovery Guarantees by Gross 2011, Candès and Recht 2008, Candès and Tao 2009 X ∈ R m × n || X || ∗ min such that X Ω = y achieves exact recovery with high probability as long as the number of samples is proportional to r ( n + m ) up to log terms The proof is based on the construction of a dual certificate

  31. Low-rank matrix estimation If data are noisy y || 2 X ∈ R m × n || X Ω − � min 2 + λ || X || ∗ where λ > 0 is a regularization parameter

  32. Matrix completion via nuclear-norm minimization Bob Molly Mary Larry   1 2 ( 1 ) 5 4 The Dark Knight 2 ( 2 ) 1 4 5 Spiderman 3     4 5 2 2 ( 1 ) Love Actually     5 4 2 1 Bridget Jones’s Diary     4 5 1 2 Pretty Woman 1 2 5 ( 5 ) 5 Superman 2

  33. Proximal gradient method Method to solve the optimization problem minimize f ( � x ) + h ( � x ) , where f is differentiable and prox h is tractable Proximal-gradient iteration: x ( 0 ) = arbitrary initialization � � � x ( k ) �� x ( k + 1 ) = prox α k h x ( k ) − α k ∇ f � � �

  34. Proximal operator of nuclear norm The solution X to 1 2 || Y − X || 2 min F + τ || X || ∗ X ∈ R m × n is obtained by soft-thresholding the SVD of Y X prox = D τ ( Y ) D τ ( M ) := U S τ ( S ) V T where M = U SV T � S ii − τ if S ii > τ S τ ( S ) ii := 0 otherwise

  35. Subdifferential of the nuclear norm Let X ∈ R m × n be a rank- r matrix with SVD USV T , where U ∈ R m × r , V ∈ R n × r and S ∈ R r × r A matrix G is a subgradient of the nuclear norm at X if and only if G := UV T + W where W satisfies || W || ≤ 1 U T W = 0 W V = 0

  36. Proximal operator of nuclear norm The subgradients of 1 2 || Y − X || 2 F + τ || X || ∗ are of the form Y − X + τ G where G is a subgradient of the nuclear norm at X D τ ( Y ) is a minimizer if and only if G = 1 τ ( Y − D τ ( Y )) is a subgradient of the nuclear norm at D τ ( Y )

Recommend


More recommend