Orthogonal tensor decomposition Daniel Hsu Columbia University - PowerPoint PPT Presentation

Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arXiv report “Tensor decompositions for learning latent variable models”, with Anandkumar, Ge, Kakade, and Telgarsky. 1

The basic decomposition problem x = ( x 1 , x 2 , . . . , x n ) ∈ R n , Notation: For a vector � � x ⊗ � x ⊗ � x denotes the 3-way array (call it a “tensor”) in R n × n × n whose ( i , j , k ) th entry is x i x j x k . Problem: Given T ∈ R n × n × n with the promise that n � λ t � v t ⊗ � v t ⊗ � T = v t t = 1 v t } of R n (w.r.t. standard inner product) for some orthonormal basis { � and positive scalars { λ t > 0 } , approximately find { ( � v t , λ t ) } (up to some desired precision) . 3

Basic questions 1. Is { ( � v t , λ t ) } uniquely determined? 2. If so, is there an efficient algorithm for finding the decomposition? 3. What if T is perturbed by some small amount? Perturbed problem: Same as the original problem, except instead of T , we are given T + E for some “error tensor” E . How “large” can E be if we want ε precision? 4

Analogous matrix problem Matrix problem: Given M ∈ R n × n with the promise that n � ⊤ λ t � v t � M = v t t = 1 v t } of R n (w.r.t. standard inner product) for some orthonormal basis { � and positive scalars { λ t > 0 } , approximately find { ( � v t , λ t ) } (up to some desired precision) . 5

Analogous matrix problem ◮ We’re promised that M is symmetric and positive definite, so requested decomposition is an eigendecomposition . In this case, an eigendecomposition always exists , and can be found efficiently . It is unique if and only if the { λ i } are distinct. ◮ What if M is perturbed by some small amount? Perturbed matrix problem: Same as the original problem, except instead of M , we are given M + E for some “error matrix” E (assume to be symmetric) . Answer provided by matrix perturbation theory ( e.g. , Davis-Kahan) , which requires � E � 2 < min i � = j | λ i − λ j | . 6

Back to the original problem Problem: Given T ∈ R n × n × n with the promise that n � λ t � v t ⊗ � v t ⊗ � T = v t t = 1 v t } of R n (w.r.t. standard inner product) for some orthonormal basis { � and positive scalars { λ t > 0 } , approximately find { ( � v t , λ t ) } (up to some desired precision) . Such decompositions do not necessarily exist , even for symmetric tensors. Where the decompositions do exist, the Perturbed problem asks if they are “robust”. 7

Main ideas Easy claim: Repeated application of a certain quadratic operator based on T (a “power iteration”) recovers a single ( � v t , λ t ) up to any desired precision. Self-reduction: Replace T with T − λ t � v t ⊗ � v t ⊗ � v t . v t = � ◮ Why?: T − λ t � v t ⊗ � v t ⊗ � τ � = t λ τ � v τ ⊗ � v τ ⊗ � v τ . ◮ Catch: We don’t recover ( � v t , λ t ) exactly, so we actually can only replace T with T − λ t � v t ⊗ � v t ⊗ � v t + E t for some “error tensor” E t . ◮ Therefore, must anyway deal with perturbations. 8

Rest of this talk 1. Identifiability of decomposition { ( � v t , λ t ) } from T . 2. A decomposition algorithm based on tensor power iteration. 3. Error analysis of decomposition algorithm. 9

Identifiability of the decomposition v t } of R n , positive scalars { λ t > 0 } : Orthonormal basis { � n � λ t � v t ⊗ � v t ⊗ � T = v t t = 1 In what sense is { ( � v t , λ t ) } uniquely determined? Claim : { � v t } are the n isolated local maximizers of certain cubic form f T : B n → R , and f T ( � v t ) = λ t . 11

Aside: multilinear form There is a natural trilinear form associated with T : � ( � x ,� y ,� z ) �→ T i , j , k x i y j z k . i , j , k For matrices M , it looks like � ( � x ,� M i , j x i y j = � x ⊤ M � y ) �→ y . i , j 12

Review: Rayleigh quotient Recall Rayleigh quotient for matrix M := � n t = 1 λ t � v t � ⊤ v t x ∈ S n − 1 ) : (assuming � n � ⊤ � x ) 2 . R M ( � � x ⊤ M � λ t ( � x ) := x = v t t = 1 Every � v t such that | λ t | = max ! is a maximizer of R M . (These are also the only local maximizers.) 13

The natural cubic form Consider the function f T : B n → R given by � � f T ( � x �→ x ) = T i , j , k x i x j x k . i , j , k For our promised T = � n t = 1 λ t � v t ⊗ � v t ⊗ � v t , f T becomes n � � � � � f T ( � v t ⊗ � v t ⊗ � x ) = λ t v t i , j , k x i x j x k t = 1 i , j , k n � � ( � v t ) i ( � v t ) j ( � = λ t v t ) k x i x j x k t = 1 i , j , k n � x ) 3 . λ t ( � ⊤ � = v t t = 1 Observation : f T ( � v t ) = λ t . 14

Variational characterization Claim : Isolated local maximizers of f T on B n are { � v t } . Objective function (with constraint): n � x ) 3 − 1 . 5 λ ( � � ⊤ � x � 2 � λ t ( � x �→ inf v t 2 − 1 ) . λ ≥ 0 t = 1 First-order condition for local maxima: n � x ) 2 � ⊤ � λ t ( � v t = λ� v t x . t = 1 Second-order condition for isolated local maxima: � � n � ⊤ − λ I ⊤ � w ⊤ � λ t ( � x ) � v t � � w ⊥ � � 2 v t v t w < 0 , x . t = 1 15

Intuition behind variational characterization v t is t th coordinate basis vector, so May as well assume � n n � � λ t x 3 x 2 x ∈ R n f T ( � max x ) = s.t. t ≤ 1 . t � t = 1 t = 1 Intuition: Suppose supp ( � x ) = { 1 , 2 } , and x 1 , x 2 > 0. x ) = λ 1 x 3 1 + λ 2 x 3 2 < λ 1 x 2 1 + λ 2 x 2 f T ( � 2 ≤ max { λ 1 , λ 2 } . Better to have | supp ( � x ) | = 1, i.e. , picking � x to be a coordinate basis vector. 16

Aside: canonical polyadic decomposition Rank- K canonical polyadic decomposition (CPD) of T , or CP) : (also called PARAFAC, CANDECOMP K � σ i � u i ⊗ � v i ⊗ � T = w i . i = 1 [Harshman/Jennrich, 1970; Kruskal, 1977; Leurgans et al., 1993] . (compared to n 3 in general) . Number of parameters: K · ( 3 n + 1 ) Fact: Our promised T has a rank- n CPD. N.B.: Overcomplete ( K > n ) CPD is also interesting and a possibility as long as K ( 3 n + 1 ) ≪ n 3 . 17

The quadratic operator Easy claim: Repeated application of a certain quadratic operator (based on T ) recovers a single ( λ t ,� v t ) up to any desired precision. For any T ∈ R n × n × n and � x = ( x 1 , x 2 , . . . , x n ) ∈ R n , define the quadratic operator � e i ∈ R n φ T ( � T i , j , k x j x k � x ) := i , j , k e i ∈ R n is the i th coordinate basis vector. where � n n � � � � � 2 � λ t � v t ⊗ � v t ⊗ � v t , then φ T ( � ⊤ � If T = x ) = λ t v t x v t . t = 1 t = 1 19

An algorithm? Recall: First-order condition for local maxima of x ) = � n x ) 3 for � ⊤ � x ∈ B n : f T ( � t = 1 λ t ( � v t n � x ) 2 � ⊤ � φ T ( � λ t ( � v t = λ� x ) = v t x t = 1 i.e. , “eigenvector”-like condition. x ∈ B n fixed under � Algorithm: Find � x �→ φ T ( � x ) / � φ T ( � x ) � . (Ignoring numerical issues, can just repeatedly apply φ T and defer normalization until later.) N.B.: Gradient ascent also works [Kolda & Mayo, ’11] . 20

Tensor power iteration [De Lathauwer et al , 2000] x ( 0 ) , and for j = 1 , 2 , . . . : Start with some � n � � � x ( j − 1 ) � � � x ( j − 1 ) � 2 � x ( j ) := φ T ⊤ � � = λ t v t v t . t = 1 � � � ∞ x ( 0 ) , the sequence x ( j ) / � � x ( j ) � Claim : For almost all initial � j = 1 converges quadratically fast to some � v t . 21

Review: matrix power iteration Recall matrix power iteration for matrix M := � n t = 1 λ t � v t � ⊤ : v t x ( 0 ) , and for j = 1 , 2 , . . . : Start with some � n � � � x ( j − 1 ) � � x ( i ) := M � x ( j − 1 ) = � ⊤ � λ t v t v t . t = 1 i.e. , component in � v t direction is scaled by λ t . If λ 1 > λ 2 ≥ · · · , then � � x ( j ) � 2 � λ 2 � 2 j ⊤ � v 1 x ( j ) � 2 ≥ 1 − k . � � � n λ 1 ⊤ � v t t = 1 i.e. , converges linearly to � v 1 (assuming gap λ 2 /λ 1 < 1) . 22

Tensor power iteration convergence analysis x ( 0 ) (initial component in � Let c t := � ⊤ � v t v t direction) ; assume WLOG λ 1 | c 1 | > λ 2 | c 2 | ≥ λ 3 | c 3 | ≥ · · · . Then n n � � � � x ( 0 ) � 2 � x ( 1 ) = λ t c 2 ⊤ � � t � λ t v t v t = v t t = 1 t = 1 i.e. , component in � v t direction is squared then scaled by λ t . Easy to show � � x ( j ) � 2 � � 2 � � 2 j + 1 ⊤ � � � v 1 λ 1 λ 2 c 2 � � x ( j ) � 2 ≥ 1 − k . � � � n � � max t � = 1 λ t λ 1 c 1 ⊤ � v t t = 1 23

Example n = 1024, λ t ∼ u.a.r. [ 0 , 1 ] . 0.012 0.01 0.008 0.006 0.004 0.002 0 0 200 400 600 800 1000 x ( 0 ) ) 2 for t = 1 , 2 , . . . , 1024 ⊤ � Value of ( � v t 24

Example n = 1024, λ t ∼ u.a.r. [ 0 , 1 ] . 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 200 400 600 800 1000 x ( 1 ) ) 2 for t = 1 , 2 , . . . , 1024 ⊤ � Value of ( � v t 25

Example n = 1024, λ t ∼ u.a.r. [ 0 , 1 ] . 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 x ( 2 ) ) 2 for t = 1 , 2 , . . . , 1024 ⊤ � Value of ( � v t 26

Orthogonal tensor decomposition Daniel Hsu Columbia University - PowerPoint PPT Presentation

Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arXiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky. 1 The basic decomposition problem x

Orthogonal Complements and Orthonormal Matrices Orthogonal Complements Defn. For a set W , the

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Orthogonal range searching Orthogonal range searching Problem: Given a set of n points Orthogonal

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Singular Value Decomposition (matrix factorization) Singular Value Decomposition The SVD is a

Memory-Efficient Parallel Computation of Tensor and Matrix Products for Big Tensor Decomposition

Multilinear Algebra and Tensor Decomposition Qibin Zhao Tensor Learning Unit RIKEN AIP

Tutorial: A brief survey on tensor rank and tensor decomposition, from a geometric perspective.

Latin Squares and Orthogonal Arrays Lucia Moura School of Electrical Engineering and Computer

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and

Working Longer Solves (almo most) Everything The correlation between employment, social

On calculation of Blakers binomial confidence limits Jan Klaschka klaschka@cs.cas.cz Inst.

Computational Results on the -Deficit of Trees Gunnar Brinkmann, Hadrien M elot and Eckhard

IMPLICATIONS OF THE 2012 ELECTION FOR HEALTH CARE: THE VOTERS PERSPECTIVE Robert J. Blendon,

Lattice QCD Precision Science for Muon g-2 and Running Coupling Kohtaroh Miura (GSI

Strategies for Spectrum Slicing Based on Restarted Lanczos Methods Carmen Campos and Jose E.

Real Smooth Points Agnes Szanto Joint with Katherine Harris (NC State) and Jonathan Hauenstein

Third Quarter 2013 Earnings Call November 20, 2013 Q3 2013 Earnings Call Forward Looking