Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arXiv report “Tensor decompositions for learning latent variable models”, with Anandkumar, Ge, Kakade, and Telgarsky. 1
The basic decomposition problem x = ( x 1 , x 2 , . . . , x n ) ∈ R n , Notation: For a vector � � x ⊗ � x ⊗ � x denotes the 3-way array (call it a “tensor”) in R n × n × n whose ( i , j , k ) th entry is x i x j x k . Problem: Given T ∈ R n × n × n with the promise that n � λ t � v t ⊗ � v t ⊗ � T = v t t = 1 v t } of R n (w.r.t. standard inner product) for some orthonormal basis { � and positive scalars { λ t > 0 } , approximately find { ( � v t , λ t ) } (up to some desired precision) . 3
Basic questions 1. Is { ( � v t , λ t ) } uniquely determined? 2. If so, is there an efficient algorithm for finding the decomposition? 3. What if T is perturbed by some small amount? Perturbed problem: Same as the original problem, except instead of T , we are given T + E for some “error tensor” E . How “large” can E be if we want ε precision? 4
Analogous matrix problem Matrix problem: Given M ∈ R n × n with the promise that n � ⊤ λ t � v t � M = v t t = 1 v t } of R n (w.r.t. standard inner product) for some orthonormal basis { � and positive scalars { λ t > 0 } , approximately find { ( � v t , λ t ) } (up to some desired precision) . 5
Analogous matrix problem ◮ We’re promised that M is symmetric and positive definite, so requested decomposition is an eigendecomposition . In this case, an eigendecomposition always exists , and can be found efficiently . It is unique if and only if the { λ i } are distinct. ◮ What if M is perturbed by some small amount? Perturbed matrix problem: Same as the original problem, except instead of M , we are given M + E for some “error matrix” E (assume to be symmetric) . Answer provided by matrix perturbation theory ( e.g. , Davis-Kahan) , which requires � E � 2 < min i � = j | λ i − λ j | . 6
Back to the original problem Problem: Given T ∈ R n × n × n with the promise that n � λ t � v t ⊗ � v t ⊗ � T = v t t = 1 v t } of R n (w.r.t. standard inner product) for some orthonormal basis { � and positive scalars { λ t > 0 } , approximately find { ( � v t , λ t ) } (up to some desired precision) . Such decompositions do not necessarily exist , even for symmetric tensors. Where the decompositions do exist, the Perturbed problem asks if they are “robust”. 7
Main ideas Easy claim: Repeated application of a certain quadratic operator based on T (a “power iteration”) recovers a single ( � v t , λ t ) up to any desired precision. Self-reduction: Replace T with T − λ t � v t ⊗ � v t ⊗ � v t . v t = � ◮ Why?: T − λ t � v t ⊗ � v t ⊗ � τ � = t λ τ � v τ ⊗ � v τ ⊗ � v τ . ◮ Catch: We don’t recover ( � v t , λ t ) exactly, so we actually can only replace T with T − λ t � v t ⊗ � v t ⊗ � v t + E t for some “error tensor” E t . ◮ Therefore, must anyway deal with perturbations. 8
Rest of this talk 1. Identifiability of decomposition { ( � v t , λ t ) } from T . 2. A decomposition algorithm based on tensor power iteration. 3. Error analysis of decomposition algorithm. 9
Identifiability of the decomposition v t } of R n , positive scalars { λ t > 0 } : Orthonormal basis { � n � λ t � v t ⊗ � v t ⊗ � T = v t t = 1 In what sense is { ( � v t , λ t ) } uniquely determined? Claim : { � v t } are the n isolated local maximizers of certain cubic form f T : B n → R , and f T ( � v t ) = λ t . 11
Aside: multilinear form There is a natural trilinear form associated with T : � ( � x ,� y ,� z ) �→ T i , j , k x i y j z k . i , j , k For matrices M , it looks like � ( � x ,� M i , j x i y j = � x ⊤ M � y ) �→ y . i , j 12
Review: Rayleigh quotient Recall Rayleigh quotient for matrix M := � n t = 1 λ t � v t � ⊤ v t x ∈ S n − 1 ) : (assuming � n � ⊤ � x ) 2 . R M ( � � x ⊤ M � λ t ( � x ) := x = v t t = 1 Every � v t such that | λ t | = max ! is a maximizer of R M . (These are also the only local maximizers.) 13
The natural cubic form Consider the function f T : B n → R given by � � f T ( � x �→ x ) = T i , j , k x i x j x k . i , j , k For our promised T = � n t = 1 λ t � v t ⊗ � v t ⊗ � v t , f T becomes n � � � � � f T ( � v t ⊗ � v t ⊗ � x ) = λ t v t i , j , k x i x j x k t = 1 i , j , k n � � ( � v t ) i ( � v t ) j ( � = λ t v t ) k x i x j x k t = 1 i , j , k n � x ) 3 . λ t ( � ⊤ � = v t t = 1 Observation : f T ( � v t ) = λ t . 14
Variational characterization Claim : Isolated local maximizers of f T on B n are { � v t } . Objective function (with constraint): n � x ) 3 − 1 . 5 λ ( � � ⊤ � x � 2 � λ t ( � x �→ inf v t 2 − 1 ) . λ ≥ 0 t = 1 First-order condition for local maxima: n � x ) 2 � ⊤ � λ t ( � v t = λ� v t x . t = 1 Second-order condition for isolated local maxima: � � n � ⊤ − λ I ⊤ � w ⊤ � λ t ( � x ) � v t � � w ⊥ � � 2 v t v t w < 0 , x . t = 1 15
Intuition behind variational characterization v t is t th coordinate basis vector, so May as well assume � n n � � λ t x 3 x 2 x ∈ R n f T ( � max x ) = s.t. t ≤ 1 . t � t = 1 t = 1 Intuition: Suppose supp ( � x ) = { 1 , 2 } , and x 1 , x 2 > 0. x ) = λ 1 x 3 1 + λ 2 x 3 2 < λ 1 x 2 1 + λ 2 x 2 f T ( � 2 ≤ max { λ 1 , λ 2 } . Better to have | supp ( � x ) | = 1, i.e. , picking � x to be a coordinate basis vector. 16
Aside: canonical polyadic decomposition Rank- K canonical polyadic decomposition (CPD) of T , or CP) : (also called PARAFAC, CANDECOMP K � σ i � u i ⊗ � v i ⊗ � T = w i . i = 1 [Harshman/Jennrich, 1970; Kruskal, 1977; Leurgans et al., 1993] . (compared to n 3 in general) . Number of parameters: K · ( 3 n + 1 ) Fact: Our promised T has a rank- n CPD. N.B.: Overcomplete ( K > n ) CPD is also interesting and a possibility as long as K ( 3 n + 1 ) ≪ n 3 . 17
The quadratic operator Easy claim: Repeated application of a certain quadratic operator (based on T ) recovers a single ( λ t ,� v t ) up to any desired precision. For any T ∈ R n × n × n and � x = ( x 1 , x 2 , . . . , x n ) ∈ R n , define the quadratic operator � e i ∈ R n φ T ( � T i , j , k x j x k � x ) := i , j , k e i ∈ R n is the i th coordinate basis vector. where � n n � � � � � 2 � λ t � v t ⊗ � v t ⊗ � v t , then φ T ( � ⊤ � If T = x ) = λ t v t x v t . t = 1 t = 1 19
An algorithm? Recall: First-order condition for local maxima of x ) = � n x ) 3 for � ⊤ � x ∈ B n : f T ( � t = 1 λ t ( � v t n � x ) 2 � ⊤ � φ T ( � λ t ( � v t = λ� x ) = v t x t = 1 i.e. , “eigenvector”-like condition. x ∈ B n fixed under � Algorithm: Find � x �→ φ T ( � x ) / � φ T ( � x ) � . (Ignoring numerical issues, can just repeatedly apply φ T and defer normalization until later.) N.B.: Gradient ascent also works [Kolda & Mayo, ’11] . 20
Tensor power iteration [De Lathauwer et al , 2000] x ( 0 ) , and for j = 1 , 2 , . . . : Start with some � n � � � x ( j − 1 ) � � � x ( j − 1 ) � 2 � x ( j ) := φ T ⊤ � � = λ t v t v t . t = 1 � � � ∞ x ( 0 ) , the sequence x ( j ) / � � x ( j ) � Claim : For almost all initial � j = 1 converges quadratically fast to some � v t . 21
Review: matrix power iteration Recall matrix power iteration for matrix M := � n t = 1 λ t � v t � ⊤ : v t x ( 0 ) , and for j = 1 , 2 , . . . : Start with some � n � � � x ( j − 1 ) � � x ( i ) := M � x ( j − 1 ) = � ⊤ � λ t v t v t . t = 1 i.e. , component in � v t direction is scaled by λ t . If λ 1 > λ 2 ≥ · · · , then � � x ( j ) � 2 � λ 2 � 2 j ⊤ � v 1 x ( j ) � 2 ≥ 1 − k . � � � n λ 1 ⊤ � v t t = 1 i.e. , converges linearly to � v 1 (assuming gap λ 2 /λ 1 < 1) . 22
Tensor power iteration convergence analysis x ( 0 ) (initial component in � Let c t := � ⊤ � v t v t direction) ; assume WLOG λ 1 | c 1 | > λ 2 | c 2 | ≥ λ 3 | c 3 | ≥ · · · . Then n n � � � � x ( 0 ) � 2 � x ( 1 ) = λ t c 2 ⊤ � � t � λ t v t v t = v t t = 1 t = 1 i.e. , component in � v t direction is squared then scaled by λ t . Easy to show � � x ( j ) � 2 � � 2 � � 2 j + 1 ⊤ � � � v 1 λ 1 λ 2 c 2 � � x ( j ) � 2 ≥ 1 − k . � � � n � � max t � = 1 λ t λ 1 c 1 ⊤ � v t t = 1 23
Example n = 1024, λ t ∼ u.a.r. [ 0 , 1 ] . 0.012 0.01 0.008 0.006 0.004 0.002 0 0 200 400 600 800 1000 x ( 0 ) ) 2 for t = 1 , 2 , . . . , 1024 ⊤ � Value of ( � v t 24
Example n = 1024, λ t ∼ u.a.r. [ 0 , 1 ] . 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 200 400 600 800 1000 x ( 1 ) ) 2 for t = 1 , 2 , . . . , 1024 ⊤ � Value of ( � v t 25
Example n = 1024, λ t ∼ u.a.r. [ 0 , 1 ] . 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 x ( 2 ) ) 2 for t = 1 , 2 , . . . , 1024 ⊤ � Value of ( � v t 26
Recommend
More recommend