Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. ⊲ Regard the A matrices as independent variables, determine the S matrices from the A matrices. ⊲ Define q ( X , A ) := argmin S ≥ 0 � X − AS � 2 F (least-squares). ⊲ Pin the values of S to those of A by recursively setting S ( ℓ ) := q ( S ( ℓ − 1) , A ( ℓ ) ). 11
Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. ⊲ Regard the A matrices as independent variables, determine the S matrices from the A matrices. ⊲ Define q ( X , A ) := argmin S ≥ 0 � X − AS � 2 F (least-squares). ⊲ Pin the values of S to those of A by recursively setting S ( ℓ ) := q ( S ( ℓ − 1) , A ( ℓ ) ). X S (0) S (1) q ( · , A (0) ) q ( · , A (1) ) 11
Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. S (0) S (1) X q ( · , A (0) ) q ( · , A (1) ) 11
Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. Training: S (0) S (1) X q ( · , A (0) ) q ( · , A (1) ) 11
Our method: Neural NMF Goal: Develop true forward and back propagation algorithms for hNMF. Training: ⊲ forward propagation: S (0) = q ( X , A (0) ), S (0) S (1) X S (1) = q ( S (0) , A (1) ), ..., q ( · , A (0) ) q ( · , A (1) ) S ( L ) = q ( S ( L − 1) , A ( L ) ) ⊲ back propagation: update { A ( i ) } with ∇ E ( { A ( i ) } ) 11
Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation 12
Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation 12
Least-squares Subroutine ⊲ least-squares is a fundamental subroutine in forward-propagation ⊲ iterative projection methods can solve these problems 12
Iterative Projection Methods
General Setup 13
General Setup We are interested in solving highly overdetermined systems of equations , A x = b , where A ∈ R m × n , b ∈ R m and m ≫ n . Rows are denoted a T i . 13
General Setup We are interested in solving highly overdetermined systems of equations , A x = b , where A ∈ R m × n , b ∈ R m and m ≫ n . Rows are denoted a T i . 13
Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method Applications: 1. Tomography (Algebraic Reconstruction Technique) 14
Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method 2. Motzkin’s Method Applications: 1. Tomography (Algebraic Reconstruction Technique) 2. Linear programming 14
Iterative Projection Methods If { x ∈ R n : A x = b } is nonempty, these methods construct an approximation to a solution: 1. Randomized Kaczmarz Method 2. Motzkin’s Method 3. Sampling Kaczmarz-Motzkin Methods (SKM) Applications: 1. Tomography (Algebraic Reconstruction Technique) 2. Linear programming 3. Average consensus (greedy gossip with eavesdropping) 14
Kaczmarz Method x 0 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15
Kaczmarz Method x 0 x 1 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15
Kaczmarz Method x 0 x 1 x 2 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15
Kaczmarz Method x 0 x 1 x 2 x 3 Given x 0 ∈ R n : � a ik � 2 1. Choose i k ∈ [ m ] with probability F . � A � 2 b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Kaczmarz 1937], [Strohmer, Vershynin 2009] 15
Motzkin’s Method x 0 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16
Motzkin’s Method x 0 x 1 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16
Motzkin’s Method x 0 x 1 x 2 Given x 0 ∈ R n : 1. Choose i k ∈ [ m ] as | a T i k := argmax i x k − 1 − b i | . i ∈ [ m ] b ik − a T ik x k − 1 2. Define x k := x k − 1 + a i k . || a ik || 2 3. Repeat. [Motzkin, Schoenberg 1954] 16
Our Hybrid Method (SKM) x 0 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]
Our Hybrid Method (SKM) x 0 x 1 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]
Our Hybrid Method (SKM) x 0 x 1 Given x 0 ∈ R n : 1. Choose τ k ⊂ [ m ] to be a x 2 sample of size β constraints chosen uniformly at random among the rows of A . 2. From the β rows, choose | a T i k := argmax i x k − 1 − b i | . i ∈ τ k 3. Define b ik − a T ik x k − 1 x k := x k − 1 + a i k . || a ik || 2 4. Repeat. 17 [De Loera, H., Needell ’17]
Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18
Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18
Experimental Convergence ⊲ β : sample size ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ ‘faster’ convergence for larger sample size 18
Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m 19
Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m 19
Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m ⊲ SKM (DeLoera, H., Needell ’17): 1 − σ 2 min ( A ) � k � E � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m 19
Convergence Rates Below are the convergence rates for the methods on a system, A x = b , which is consistent with unique solution x , whose rows have been normalized to have unit norm. ⊲ RK (Strohmer, Vershynin ’09): 1 − σ 2 min ( A ) � k � E || x k − x || 2 || x 0 − x || 2 2 ≤ 2 m ⊲ MM (Agmon ’54): 1 − σ 2 min ( A ) � k � � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m ⊲ SKM (DeLoera, H., Needell ’17): 1 − σ 2 min ( A ) � k � E � x k − x � 2 � x 0 − x � 2 2 ≤ 2 m Why are these all the same? 19
A Pathological Example x 0 20
Structure of the Residual Several works have used sparsity of the residual to improve the convergence rate of greedy methods. [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19] 21
Structure of the Residual Several works have used sparsity of the residual to improve the convergence rate of greedy methods. [De Loera, H., Needell ’17], [Bai, Wu ’18], [Du, Gao ’19] However, not much sparsity can be expected in most cases. Instead, we’d like to use dynamic range of the residual to guarantee faster convergence. β ) � A τ x k − b τ � 2 � τ ∈ ( [ m ] 2 γ k := � β ) � A τ x k − b τ � 2 τ ∈ ( [ m ] ∞ 21
Accelerated Convergence Rate Theorem (H. - Ma 2019) Let A be normalized so � a i � 2 = 1 for all rows i = 1 , ..., m. If the system A x = b is consistent with the unique solution x ∗ then the SKM method converges at least linearly in expectation and the rate depends on the dynamic range of the random sample of rows of A, τ j . Precisely, in the j + 1 st iteration of SKM, we have 1 − βσ 2 min ( A ) � � E τ j � x j +1 − x ∗ � 2 � x j − x ∗ � 2 2 ≤ 2 , γ j m β ) � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 where γ j := ∞ . β ) � A τ x j − b τ � 2 � τ ∈ ( [ m ] 22
Accelerated Convergence Rate ⊲ A is 50000 × 100 Gaussian matrix, consistent system ⊲ bound uses dynamic range of sample of β rows 23
What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24
What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24
What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β 24
What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β E τ k � x k − x ∗ � 2 2 ≤ α � x k − 1 − x ∗ � 2 2 Previous: α = 1 − σ 2 min ( A ) RK m α = 1 − σ 2 min ( A ) SKM m 1 − σ 2 ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) MM 4 m [H., Needell 2019] 24
What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] 2 β ) Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β E τ k � x k − x ∗ � 2 2 ≤ α � x k − 1 − x ∗ � 2 2 Previous: Current: α = 1 − σ 2 α = 1 − σ 2 min ( A ) min ( A ) RK m m α = 1 − σ 2 1 − βσ 2 ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) min ( A ) SKM m m m 1 − σ 2 ≤ α ≤ 1 − σ 2 min ( A ) ≤ α ≤ 1 − σ 2 min ( A ) min ( A ) min ( A ) 1 − σ 2 MM 4 m m [H., Needell 2019], [H., Ma 2019] 24
What can we say about γ j ? � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 2 Recall γ j := ∞ . � A τ x j − b τ � 2 � τ ∈ ( [ m ] β ) 1 ≤ γ j ≤ β ⊲ nontrivial bounds on γ k for Gaussian and average consensus systems 24
Now can we determine the optimal β ? 25
Now can we determine the optimal β ? Roughly, if we know the value of γ j , we can (just) do it. 25
Now can we determine the optimal β ? Roughly, if we know the value of γ j , we can (just) do it. 25
Back to Hierarchical NMF 26
Back to Hierarchical NMF 26
Back to Hierarchical NMF 26
Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) 26
Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) ⊲ Deep NMF [Flenner, Hunter ’18] 26
Back to Hierarchical NMF Compare: ⊲ hNMF (sequential NMF) ⊲ Deep NMF [Flenner, Hunter ’18] ⊲ Neural NMF 26
Applications
Experimental results: synthetic data 27
Experimental results: synthetic data ⊲ unsupervised reconstruction with two-layer structure ( k (0) = 9 , k (1) = 4) 27
Experimental results: synthetic data ⊲ unsupervised reconstruction with two-layer structure ( k (0) = 9 , k (1) = 4) 27
Recommend
More recommend