scaling the hierarchical topic modeling mountain
play

Scaling the Hierarchical Topic Modeling Mountain Neural NMF and - PowerPoint PPT Presentation

Scaling the Hierarchical Topic Modeling Mountain Neural NMF and Iterative Projection Methods Jamie Haddock Harvey Mudd College, January 28, 2020 Computational and Applied Mathematics UCLA 1 Research Overview Data Math. Data Science


  1. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. ⊲ Regard the A matrices as independent variables, determine the S matrices from the A matrices. ⊲ Define q ( X , A ) := argmin S ≥ 0 � X − AS � 2 F (least-squares). ⊲ Pin the values of S to those of A by recursively setting S ( ℓ ) := q ( S ( ℓ − 1) , A ( ℓ ) ). 11

  2. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. ⊲ Regard the A matrices as independent variables, determine the S matrices from the A matrices. ⊲ Define q ( X , A ) := argmin S ≥ 0 � X − AS � 2 F (least-squares). ⊲ Pin the values of S to those of A by recursively setting S ( ℓ ) := q ( S ( ℓ − 1) , A ( ℓ ) ). X S (0) S (1) q ( · , A (0) ) q ( · , A (1) ) 11

  3. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. S (0) S (1) X q ( · , A (0) ) q ( · , A (1) ) 11

  4. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. Training: S (0) S (1) X q ( · , A (0) ) q ( · , A (1) ) 11

  5. Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. Training: ⊲ forward propagation: S (0) = q ( X , A (0) ), S (0) S (1) X S (1) = q ( S (0) , A (1) ), ..., q ( · , A (0) ) q ( · , A (1) ) S ( L ) = q ( S ( L − 1) , A ( L ) ) ⊲ back propagation: update { A ( i ) } with ∇ E ( { A ( i ) } ) 11

  6. Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation 12

  7. Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation 12

  8. Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation ⊲ iterative projection methods can solve these problems 12

  9. Iterative Projection Methods

  10. General Setup 13

  11. General Setup We are interested in solving highly overdetermined systems of equations , A x = b , where A ∈ R m × n , b ∈ R m and m ≫ n . Rows are denoted a T i . 13

  12. General Setup We are interested in solving highly overdetermined systems of equations , A x = b , where A ∈ R m × n , b ∈ R m and m ≫ n . Rows are denoted a T i . 13

  13. Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method Applications: 1. Tomography (Algebraic Reconstruction Technique) 14

  14. Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method 2. Motzkin’s Method Applications: 1. Tomography (Algebraic Reconstruction Technique) 2. Linear programming 14

  15. Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method 2. Motzkin’s Method 3. Sampling Kaczmarz-Motzkin Methods (SKM) Applications: 1. Tomography (Algebraic Reconstruction Technique) 2. Linear programming 3. Average consensus (greedy gossip with eavesdropping) 14

  16. Kaczmarz Method x 0 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  17. Kaczmarz Method x 0 x 1 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  18. Kaczmarz Method x 0 x 1 x 2 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  19. Kaczmarz Method x 0 x 1 x 2 x 3 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15

  20. Motzkin’s Method x 0 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16

  21. Motzkin’s Method x 0 x 1 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16

  22. Motzkin’s Method x 0 x 1 x 2 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16

  23. Our Hybrid Method (SKM) x 0 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]

  24. Our Hybrid Method (SKM) x 0 x 1 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]

  25. Our Hybrid Method (SKM) x 0 x 1 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a x 2 sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]

  26. Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18

  27. Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18

  28. Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18

  29. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m 19

  30. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m 19

  31. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m ⊲ SKM (DeLoera, H., Needell ’17): 1 − σ 2 min ( A ) � k � E � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m 19

  32. Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m ⊲ SKM (DeLoera, H., Needell ’17): 1 − σ 2 min ( A ) � k � E � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m Why are these all the same? 19

  33. A Pathological Example x 0 20

  34. Structure of the Residual Several works have used sparsity of the residual to improve the convergence rate of greedy methods. [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19] 21

  35. Structure of the Residual Several works have used sparsity of the residual to improve the convergence rate of greedy methods. [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19] However, not much sparsity can be expected in most cases. Instead, we’d like to use dynamic range of the residual to guarantee faster convergence. β ) � A τ x k − b τ � 2 � τ ∈ ( [ m ] 2 γ k := � β ) � A τ x k − b τ � 2 τ ∈ ( [ m ] ∞ 21

  36. Accelerated Convergence Rate Theorem (H. - Ma 2019) Let A be normalized so � a i � 2 = 1 for all rows i = 1 , ..., m. If the system A x = b is consistent with the unique solution x ∗ then the SKM method converges at least linearly in expectation and the rate depends on the dynamic range of the random sample of rows of A, τ j . Precisely, in the j + 1 st iteration of SKM, we have 1 − βσ 2 min ( A ) � � E τ j � x j +1 − x ∗ � 2 � x j − x ∗ � 2 2 ≤ 2 , γ j m β ) � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 where γ j := ∞ . β ) � A τ x j − b τ � 2 � τ ∈ ( [ m ] 22

  37. Accelerated Convergence Rate ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ bound uses dynamic range of sample of β rows 23

  38. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24

  39. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24

  40. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24

  41. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β E τ k � x k − x ∗ � 2 2 ≤ α � x k − 1 − x ∗ � 2 2 Previous: α = 1 − σ 2 min ( A ) RK m α = 1 − σ 2 min ( A ) SKM m 1 − σ 2 ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) MM 4 m [H., Needell 2019] 24

  42. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β E τ k � x k − x ∗ � 2 2 ≤ α � x k − 1 − x ∗ � 2 2 Previous: Current: α = 1 − σ 2 α = 1 − σ 2 min ( A ) min ( A ) RK m m α = 1 − σ 2 1 − βσ 2 ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) min ( A ) SKM m m m 1 − σ 2 ≤ α ≤ 1 − σ 2 min ( A ) ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) min ( A ) 1 − σ 2 MM 4 m m [H., Needell 2019], [H., Ma 2019] 24

  43. What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 2 Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β ⊲ nontrivial bounds on γ k for Gaussian and average consensus systems 24

  44. Now can we determine the optimal β ? 25

  45. Now can we determine the optimal β ? Roughly, if we know the value of γ j , we can (just) do it. 25

  46. Now can we determine the optimal β ? Roughly, if we know the value of γ j , we can (just) do it. 25

  47. Back to Hierarchical NMF 26

  48. Back to Hierarchical NMF 26

  49. Back to Hierarchical NMF 26

  50. Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) 26

  51. Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) ⊲ Deep NMF [Flenner, Hunter ’18] 26

  52. Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) ⊲ Deep NMF [Flenner, Hunter ’18] ⊲ Neural NMF 26

  53. Applications

  54. Experimental results: synthetic data 27

  55. Experimental results: synthetic data ⊲ unsupervised reconstruction with two-layer structure ( k (0) = 9 , k (1) = 4) 27

  56. Experimental results: synthetic data ⊲ unsupervised reconstruction with two-layer structure ( k (0) = 9 , k (1) = 4) 27

Recommend


More recommend