large scale matrix analysis and inference
play

Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred - PowerPoint PPT Presentation

Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32 Introductory musing What is a matrix? a i , j 1 A vector of n 2 parameters 2 A


  1. Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32

  2. Introductory musing — What is a matrix? a i , j 1 A vector of n 2 parameters 2 A covariance 3 A generalized probability distribution 4 . . . 2 / 32

  3. 1. A vector of n 2 parameters When you regularize with the squared Frobenius norm � || W || 2 min F + loss( tr ( WX n )) W n 3 / 32

  4. 1. A vector of n 2 parameters When you regularize with the squared Frobenius norm � || W || 2 min F + loss( tr ( WX n )) W n Equivalent to � || vec( W ) || 2 min 2 + loss(vec( W ) · vec( X n )) vec( W ) n No structure: n 2 independent variables 4 / 32

  5. 2. A covariance View the symmetric positive definite matrix C as a covariance matrix of some random feature vector c ∈ R n , i.e. � ( c − E ( c ))( c − E ( c )) ⊤ � C = E n features plus their pairwise interactions 5 / 32

  6. Symmetric matrices as ellipses Ellipse = { Cu : � u � 2 = 1 } Dotted lines connect point u on unit ball with point Cu on ellipse 6 / 32

  7. Symmetric matrices as ellipses Eigenvectors form axes Eigenvalues are lengths 7 / 32

  8. Dyads uu ⊤ , where u unit vector One eigenvalue one All others zero Rank one projection matrix 8 / 32

  9. Directional variance along direction u V ( c ⊤ u ) = u ⊤ Cu = tr ( C uu ⊤ ) ≥ 0 The outer figure eight is direction u times the variance u ⊤ C u PCA: find direction of largest variance 9 / 32

  10. 3 dimensional variance plots tr ( C uu ⊤ ) is generalized probability when tr ( C ) = 1 10 / 32

  11. 3. Generalized probability distributions ω = ( . 2 , . 1 ., . 6 , . 1) ⊤ Probability vector = � e i ω i i ���� ���� mixture coefficients pure events W = � w i w ⊤ Density matrix ω i i i ���� � �� � mixture coefficients pure density matrices 11 / 32

  12. 3. Generalized probability distributions ω = ( . 2 , . 1 ., . 6 , . 1) ⊤ Probability vector = � e i ω i i ���� ���� mixture coefficients pure events W = � w i w ⊤ Density matrix ω i i i ���� � �� � mixture coefficients pure density matrices Matrices as generalized distributions 12 / 32

  13. 3. Generalized probability distributions ω = ( . 2 , . 1 ., . 6 , . 1) ⊤ Probability vector = � e i ω i i ���� ���� mixture coefficients pure events W = � w i w ⊤ Density matrix ω i i i ���� � �� � mixture coefficients pure density matrices Matrices as generalized distributions Many mixtures lead to same density matrix There always exists a decomposition into n eigendyads Density matrix: Symmetric positive matrix of trace one 13 / 32

  14. It’s like a probability! Total variance along orthogonal set of directions is 1 u ⊤ 1 Wu 1 + u ⊤ 2 Wu 2 = 1 a + b + c = 1 14 / 32

  15. Uniform density? All dyads have generalized probability 1 1 n I n tr (1 n I uu ⊤ ) = 1 n tr ( uu ⊤ ) = 1 n Generalized probabilities of n orthogonal dyads sum to 1 15 / 32

  16. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 16 / 32

  17. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 17 / 32

  18. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 18 / 32

  19. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 19 / 32

  20. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 1 update with data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 20 / 32

  21. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 2 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 21 / 32

  22. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 3 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 22 / 32

  23. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 4 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 23 / 32

  24. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 10 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 24 / 32

  25. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 20 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 25 / 32

  26. Bayes’ rules vector matrix P ( M i ) · P ( y | M i ) D ( M ) ⊙ D ( y | M ) Bayes rule P ( M i | y )= D ( M | y ) = � j P ( M j ) · P ( y | M j ) tr ( D ( M ) ⊙ D ( y | M ) A ⊙ B := exp ( log A + log B ) 26 / 32

  27. Bayes’ rules vector matrix P ( M i ) · P ( y | M i ) D ( M ) ⊙ D ( y | M ) Bayes rule P ( M i | y )= D ( M | y ) = � j P ( M j ) · P ( y | M j ) tr ( D ( M ) ⊙ D ( y | M ) A ⊙ B := exp ( log A + log B ) Regularizer Entropy Quantum Entropy 27 / 32

  28. Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy ⊙ becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case 28 / 32

  29. Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy ⊙ becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case This phenomenon has been dubbed the “free matrix lunch” Size of matrix = size of vector = n 29 / 32

  30. PCA setup Data vectors C = � n x n x ⊤ n u ⊤ C u tr ( Cuu ⊤ ) max = max unit u dyad uu ⊤ � �� � � �� � linear in uu ⊤ not convex in u c ⊤ e i Corresponding vector problem max e i ���� linear in e i Vector problem is matrix problem when everything happens in the same eigensystem Uncertainty over unit: probability vector Uncertainty over dyads: density matrix Uncertainty over k -sets of units: capped probability vector Uncertainty over rank k projection matrices: capped density matrix 30 / 32

  31. For PCA Solve the vector problem first Do all bounds Lift to matrix case: essentially replace · by ⊙ Regret bounds stay the same Free Matrix Lunch 31 / 32

  32. Questions When can you “lift”vector case to matrix case? When is there a free matrix lunch? Lifting matrices to tensors? Efficient algorithms for large matrices? Approximations of ⊙ Avoid eigenvalue decomposition by sampling . . . 32 / 32

Recommend


More recommend