Independent Component Analysis h 1 h 2 h k Independent sources, unknown mixing. Blind source separation. A Application: speech, image, video.. k sources. d dimensions. x 2 x d x 1 x = Ah + z . z ∼ N (0 , σ 2 I ) . Sources h i are independent. Form cumulant tensor M 4 := E [ x ⊗ 4 ] − E [ x i 1 x i 2 ] E [ x i 3 x i 4 ] . . . � κ i a i ⊗ a i ⊗ a i ⊗ a i . = i Kurtosis: κ i := E [ h 4 i ] − 3 . Assumption: sources have non-zero kurtosis ( κ i � = 0) .
Outline Introduction 1 Latent Variable Models and Moments 2 Community Detection in Graphs 3 Analysis of Tensor Power Method 4 Advanced Topics 5 Conclusion 6
Social Networks & Recommender Systems Social Networks Recommender Systems Network of social ties, e.g. Observed: Ratings of users for friendships, co-authorships various products. Hidden: communities of actors. Goal: New recommendations. Modeling: User/product groups.
Network Community Models How are communities formed? How do communities interact?
Network Community Models How are communities formed? How do communities interact? 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1
Network Community Models How are communities formed? How do communities interact? 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1
Network Community Models How are communities formed? How do communities interact? 0.9 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1
Network Community Models How are communities formed? How do communities interact? 0.1 0.8 0.1 0.1 0.4 0.3 0.3 0.7 0.2 0.1
Network Community Models How are communities formed? How do communities interact? 0.1 0.8 0.1 0.4 0.3 0.3 0.7 0.2 0.1
Mixed Membership Model (Airoldi et al) k communities and n nodes. Graph G ∈ R n × n (adjacency matrix). Fractional memberships: π x ∈ R k membership of node x . � ∆ k − 1 := { π x ∈ R k , π x ( i ) ∈ [0 , 1] , ∀ x ∈ [ n ] } . π x ( i ) = 1 , i Node memberships { π u } drawn from Dirichlet distribution.
Mixed Membership Model (Airoldi et al) k communities and n nodes. Graph G ∈ R n × n (adjacency matrix). Fractional memberships: π x ∈ R k membership of node x . � ∆ k − 1 := { π x ∈ R k , π x ( i ) ∈ [0 , 1] , ∀ x ∈ [ n ] } . π x ( i ) = 1 , i Node memberships { π u } drawn from Dirichlet distribution. Edges conditionally independent given community memberships: G i,j ⊥ ⊥ G a,b | π i , π j , π a , π b . Edge probability averaged over community memberships P [ G i,j = 1 | π i , π j ] = E [ G i,j | π i , π j ] = π ⊤ i Pπ j . P ∈ R k × k : average edge connectivity for pure communities. Airoldi, Blei, Fienberg, and Xing. Mixed membership stochastic blockmodels. J. of Machine Learning Research, June 2008.
Networks under Community Models
Networks under Community Models Stochastic Block Model α 0 = 0
Networks under Community Models Stochastic Block Model Mixed Membership Model α 0 = 0 α 0 = 1
Networks under Community Models Stochastic Block Model Mixed Membership Model α 0 = 0 α 0 = 10
Networks under Community Models Stochastic Block Model Mixed Membership Model α 0 = 0 α 0 = 10 Unifying Assumption Edges conditionally independent given community memberships
Subgraph Counts as Graph Moments
Subgraph Counts as Graph Moments
Subgraph Counts as Graph Moments 3 -star counts sufficient for identifiability and learning of MMSB
Subgraph Counts as Graph Moments 3 -star counts sufficient for identifiability and learning of MMSB 3 -Star Count Tensor 1 ˜ x M 3 ( a, b, c ) = | X | # of common neighbors in X X � 1 = G ( x, a ) G ( x, b ) G ( x, c ) . | X | x ∈ X A B C � 1 ˜ [ G ⊤ x,A ⊗ G ⊤ x,B ⊗ G ⊤ M 3 = x,C ] c a b | X | x ∈ X
Multi-view Representation Conditional independence of the three views π x : community membership vector of node x . 3 -stars Graphical model π x x X U V W A B C G ⊤ G ⊤ G ⊤ x,A x,C x,B E [ G ⊤ x,A | Π] = Π ⊤ A P ⊤ π x = Uπ x . Linear Multiview Model:
Subgraph Counts as Graph Moments Second and Third Order Moments � ˆ 1 Z C G ⊤ x,C G x,B Z ⊤ M 2 := B − shift | X | x � � � ˆ 1 G ⊤ x,A ⊗ Z B G ⊤ x,B ⊗ Z C G ⊤ − shift M 3 := x,C | X | x Symmetrize Transition Matrices x X Pairs C,B := G ⊤ X,C ⊗ G ⊤ X,B Z B := Pairs ( A, C ) (Pairs ( B, C )) † A B C Z C := Pairs ( A, B ) (Pairs ( C, B )) † c a b E [ G ⊤ Linear Multiview Model: x,A | Π] = Uπ x . � � α i α i E [ ˆ E [ ˆ M 2 | Π A,B,C ] = u i ⊗ u i , M 3 | Π A,B,C ] = u i ⊗ u i ⊗ u i . α 0 α 0 i i
Outline Introduction 1 Latent Variable Models and Moments 2 Community Detection in Graphs 3 Analysis of Tensor Power Method 4 Advanced Topics 5 Conclusion 6
Recap of Tensor Method � � M 2 = w i a i ⊗ a i , M 3 = w i a i ⊗ a i ⊗ a i . i i v 1 a 1 W a 2 v 2 a 3 Whitening matrix W from SVD of M 2 . v 3 Multilinear transform: T = M 3 ( W, W, W ) . Tensor M 3 Tensor T Eigenvectors of T through power method and deflation. T ( I, v, v ) v �→ � T ( I, v, v ) � .
Orthogonal Tensor Eigen Decomposition � T = λ i v i ⊗ v i ⊗ v i , � v i , v j � = δ i,j , ∀ i, j. i ∈ [ k ] T ( I, v 1 , v 1 ) = � i λ i � v i , v 1 � 2 v i = λ 1 v 1 . v i are eigenvectors of tensor T .
Orthogonal Tensor Eigen Decomposition � T = λ i v i ⊗ v i ⊗ v i , � v i , v j � = δ i,j , ∀ i, j. i ∈ [ k ] T ( I, v 1 , v 1 ) = � i λ i � v i , v 1 � 2 v i = λ 1 v 1 . v i are eigenvectors of tensor T . Tensor Power Method Start from an initial vector v . T ( I, v, v ) v �→ � T ( I, v, v ) � .
Orthogonal Tensor Eigen Decomposition � T = λ i v i ⊗ v i ⊗ v i , � v i , v j � = δ i,j , ∀ i, j. i ∈ [ k ] T ( I, v 1 , v 1 ) = � i λ i � v i , v 1 � 2 v i = λ 1 v 1 . v i are eigenvectors of tensor T . Tensor Power Method Start from an initial vector v . T ( I, v, v ) v �→ � T ( I, v, v ) � . Questions Is there convergence? Does the convergence depend on initialization? What about performance under noise?
Recap of Matrix Eigen Analysis For symmetric M ∈ R k × k , eigen decomposition: M = � i λ i v i v ⊤ i . Eigen vectors are fixed points: Mv = λv . ◮ In our notation: M ( I, v ) = λv . Uniqueness (Identifiability): Iff. λ i are distinct.
Recap of Matrix Eigen Analysis For symmetric M ∈ R k × k , eigen decomposition: M = � i λ i v i v ⊤ i . Eigen vectors are fixed points: Mv = λv . ◮ In our notation: M ( I, v ) = λv . Uniqueness (Identifiability): Iff. λ i are distinct. M ( I, v ) Power method: v �→ � M ( I, v ) � .
Recap of Matrix Eigen Analysis For symmetric M ∈ R k × k , eigen decomposition: M = � i λ i v i v ⊤ i . Eigen vectors are fixed points: Mv = λv . ◮ In our notation: M ( I, v ) = λv . Uniqueness (Identifiability): Iff. λ i are distinct. M ( I, v ) Power method: v �→ � M ( I, v ) � . Convergence properties Let λ 1 > λ 2 . . . > λ d . { v i } form a basis. Let initialization v = � i c i v i . If c 1 � = 0 , power method converges to v 1 .
Recap of Matrix Eigen Analysis For symmetric M ∈ R k × k , eigen decomposition: M = � i λ i v i v ⊤ i . Eigen vectors are fixed points: Mv = λv . ◮ In our notation: M ( I, v ) = λv . Uniqueness (Identifiability): Iff. λ i are distinct. M ( I, v ) Power method: v �→ � M ( I, v ) � . Convergence properties Let λ 1 > λ 2 . . . > λ d . { v i } form a basis. Let initialization v = � i c i v i . If c 1 � = 0 , power method converges to v 1 . Perturbation analysis (Davis-Kahan): T + E Require � E � < min i � = j | λ i − λ j | .
Optimization viewpoint of matrix analysis � M = λ i v i ⊗ v i , λ 1 > λ 2 . . . . i ∈ [ k ] � Rayleigh quotient at v : M ( v, v ) = v ⊤ Mv = λ i � v i , v � 2 . i Optimization problem: max M ( v, v ) s.t. � v � = 1 . v
Optimization viewpoint of matrix analysis � M = λ i v i ⊗ v i , λ 1 > λ 2 . . . . i ∈ [ k ] � Rayleigh quotient at v : M ( v, v ) = v ⊤ Mv = λ i � v i , v � 2 . i Optimization problem: max M ( v, v ) s.t. � v � = 1 . v Non-convex problem. Global maximizer is v 1 (top eigenvector).
Optimization viewpoint of matrix analysis � M = λ i v i ⊗ v i , λ 1 > λ 2 . . . . i ∈ [ k ] � Rayleigh quotient at v : M ( v, v ) = v ⊤ Mv = λ i � v i , v � 2 . i Optimization problem: max M ( v, v ) s.t. � v � = 1 . v Non-convex problem. Global maximizer is v 1 (top eigenvector). What are the local optimizers?
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) .
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) .
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 .
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . M ( I,v ) Power method v �→ � M ( I,v ) � is a version of gradient ascent.
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . M ( I,v ) Power method v �→ � M ( I,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 2( M − λI ) .
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . M ( I,v ) Power method v �→ � M ( I,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 2( M − λI ) . Local optimality condition for constrained optimization w ⊤ ∇ 2 L ( v, λ ) w < 0 for all w ⊥ v , at a stationary point v .
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . M ( I,v ) Power method v �→ � M ( I,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 2( M − λI ) . Local optimality condition for constrained optimization w ⊤ ∇ 2 L ( v, λ ) w < 0 for all w ⊥ v , at a stationary point v . Verify: v 1 is the only local optimum. Verify: All other eigenvectors are saddle points.
Optimization viewpoint of matrix analysis Optimization: max M ( v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := M ( v, v ) − λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 2( M ( I, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . M ( I,v ) Power method v �→ � M ( I,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 2( M − λI ) . Local optimality condition for constrained optimization w ⊤ ∇ 2 L ( v, λ ) w < 0 for all w ⊥ v , at a stationary point v . Verify: v 1 is the only local optimum. Verify: All other eigenvectors are saddle points. Power method recovers v 1 when initialization v satisfies � v, v 1 � � = 0 .
Analysis of Tensor Power Method � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Bad news about tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.
Analysis of Tensor Power Method � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Bad news about tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an orthogonal decomposition exists.
Analysis of Tensor Power Method � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Bad news about tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an orthogonal decomposition exists. Characterization of components { v i } { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i .
Analysis of Tensor Power Method � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Bad news about tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an orthogonal decomposition exists. Characterization of components { v i } { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i . Bad news: There can be other eigenvectors (unlike matrix case). v = v 1 + v 2 satisfies T ( I, v, v ) = 1 √ √ 2 v. λ i ≡ 1 . 2
Analysis of Tensor Power Method � T = λ i v i ⊗ v i ⊗ v i . i ∈ [ k ] Bad news about tensors Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general. We will see that a tractable case is when we are promised that an orthogonal decomposition exists. Characterization of components { v i } { v i } are eigenvectors: T ( I, v i , v i ) = λ i v i . Bad news: There can be other eigenvectors (unlike matrix case). v = v 1 + v 2 satisfies T ( I, v, v ) = 1 √ √ 2 v. λ i ≡ 1 . 2 How do we avoid spurious solutions (not part of decomposition)?
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) .
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) .
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 .
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . T ( I,v,v ) Power method v �→ � T ( I,v,v ) � is a version of gradient ascent.
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . T ( I,v,v ) Power method v �→ � T ( I,v,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 3(2 T ( I, I, v ) − λI ) .
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . T ( I,v,v ) Power method v �→ � T ( I,v,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 3(2 T ( I, I, v ) − λI ) . Local optimality condition for constrained optimization w ⊤ ∇ 2 L ( v, λ ) w < 0 for all w ⊥ v , at a stationary point v .
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . T ( I,v,v ) Power method v �→ � T ( I,v,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 3(2 T ( I, I, v ) − λI ) . Local optimality condition for constrained optimization w ⊤ ∇ 2 L ( v, λ ) w < 0 for all w ⊥ v , at a stationary point v . Verify: { v i } are the only local optima. Verify: All other eigenvectors are saddle points.
Optimization viewpoint of tensor analysis Optimization: max T ( v, v, v ) s.t. � v � = 1 . v Lagrangian: L ( v, λ ) := T ( v, v, v ) − 1 . 5 λ ( v ⊤ v − 1) . First derivative: ∇ L ( v, λ ) = 3( T ( I, v, v ) − λv ) . Stationary points are eigenvectors: ∇ L ( v, λ ) = 0 . T ( I,v,v ) Power method v �→ � T ( I,v,v ) � is a version of gradient ascent. Second derivative: ∇ 2 L ( v, λ ) = 3(2 T ( I, I, v ) − λI ) . Local optimality condition for constrained optimization w ⊤ ∇ 2 L ( v, λ ) w < 0 for all w ⊥ v , at a stationary point v . Verify: { v i } are the only local optima. Verify: All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!
Review: matrix power iteration Recall matrix power iteration for matrix M := � i λ i v i v ⊤ i : Start with some v , and for j = 1 , 2 , . . . : � � � v ⊤ v �→ Mv = λ i i v v i . i i.e. , component in v i direction is scaled by λ i .
Review: matrix power iteration Recall matrix power iteration for matrix M := � i λ i v i v ⊤ i : Start with some v , and for j = 1 , 2 , . . . : � � � v ⊤ v �→ Mv = λ i i v v i . i i.e. , component in v i direction is scaled by λ i . If λ 1 > λ 2 ≥ · · · , then in t iterations, � � 2 � λ 2 � 2 t v ⊤ 1 v � 2 ≥ 1 − k . � � λ 1 v ⊤ i v i Converges linearly to v 1 assuming gap λ 2 /λ 1 < 1 .
Tensor power iteration convergence analysis Let c i := v ⊤ i v initial component in v i direction; assume WLOG λ 1 | c 1 | > λ 2 | c 2 | ≥ λ 3 | c 3 | ≥ · · · .
Tensor power iteration convergence analysis Let c i := v ⊤ i v initial component in v i direction; assume WLOG λ 1 | c 1 | > λ 2 | c 2 | ≥ λ 3 | c 3 | ≥ · · · . Then � � � � 2 v i = v ⊤ λ i c 2 v �→ λ i i v i v i i i i.e. , component in v i direction is squared then scaled by λ i .
Tensor power iteration convergence analysis Let c i := v ⊤ i v initial component in v i direction; assume WLOG λ 1 | c 1 | > λ 2 | c 2 | ≥ λ 3 | c 3 | ≥ · · · . Then � � � � 2 v i = v ⊤ λ i c 2 v �→ λ i i v i v i i i i.e. , component in v i direction is squared then scaled by λ i . By induction, in t iterations � λ 2 t − 1 c 2 t v = v i , i i i so � � 2 � 2 � � � 2 t +1 v ⊤ � � 1 v λ 1 v 2 c 2 � � � 2 ≥ 1 − k . � � � � max i � =1 λ i v 1 c 1 v ⊤ i v i
Matrix vs. tensor power iteration Matrix power iteration : Tensor power iteration :
Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Tensor power iteration : Requires gap between largest and second-largest λ i | c i | . 1 Property of the tensor and initialization v .
Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Converges to top eigenvector. 2 Tensor power iteration : Requires gap between largest and second-largest λ i | c i | . 1 Property of the tensor and initialization v . Converges to v i for which v i | c i | = max! could be any of them. 2
Matrix vs. tensor power iteration Matrix power iteration : Requires gap between largest and second-largest eigenvalue. 1 Property of the matrix only. Converges to top eigenvector. 2 Linear convergence. Need O (log(1 /ǫ )) iterations. 3 Tensor power iteration : Requires gap between largest and second-largest λ i | c i | . 1 Property of the tensor and initialization v . Converges to v i for which v i | c i | = max! could be any of them. 2 Quadratic convergence. Need O (log log(1 /ǫ )) iterations. 3
Recommend
More recommend