low rank tensor techniques for high dimensional problems
play

Low-Rank Tensor Techniques for High-Dimensional Problems Daniel - PowerPoint PPT Presentation

Low-Rank Tensor Techniques for High-Dimensional Problems Daniel Kressner CADMOS Chair for Numerical Algorithms and HPC MATHICSE, EPFL 1 Contents What is a tensor? Applications Matrices and low rank CP and Tucker


  1. High-dimensional elliptic PDEs: 3D model problem ◮ Discretization of 1D-Laplace:   − 1 2   ...   − 1 2   − ∂ xx ≈ =: A .   ... ...   − 1 − 1 2 ◮ Application in each coordinate direction: − ∂ ξ 1 ξ 1 u ( ξ 1 , ξ 2 , ξ 3 ) ≈ A ◦ 1 U , − ∂ ξ 2 ξ 2 u ( ξ 1 , ξ 2 , ξ 3 ) ≈ A ◦ 2 U , − ∂ ξ 3 ξ 3 u ( ξ 1 , ξ 2 , ξ 3 ) ≈ A ◦ 3 U . ◮ Hence, − ∆ u ≈ A ◦ 1 U + A ◦ 2 U + A ◦ 3 U or in vectorized form with u = vec ( U ) : − ∆ u ≈ ( I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I ) u . 24

  2. High-dimensional elliptic PDEs: 3D model problem Finite difference discretization of model problem − ∆ u = f in Ω , u | ∂ Ω = 0 for Ω = [ 0 , 1 ] 3 takes the form ( I ⊗ I ⊗ A + I ⊗ A ⊗ I + A ⊗ I ⊗ I ) u = f . Similar structure for finite element discretization with tensorized FEs: � � � V ⊗ W ⊗ Z = α ijk v i ( ξ 1 ) w j ( ξ 2 ) z k ( ξ 3 ) : α ijk ∈ R with V = { v 1 ( ξ 1 ) , . . . , v n ( ξ 1 ) } , W = { w 1 ( ξ 2 ) , . . . , w n ( ξ 2 ) } , Z = { z 1 ( ξ 3 ) , . . . , z n ( ξ 3 ) } Galerkin discretization ( K V ⊗ M W ⊗ M Z + M V ⊗ K W ⊗ M Z + M V ⊗ M W ⊗ K Z ) u = f , with 1D mass/stiffness matrices M V , M W , M Z , K V , K W , K Z . 25

  3. High-dimensional elliptic PDEs: Arbitrary dimensions Finite difference discretization of model problem − ∆ u = f in Ω , u | ∂ Ω = 0 for Ω = [ 0 , 1 ] d takes the form � � d � I ⊗ · · · ⊗ I ⊗ A ⊗ I ⊗ · · · ⊗ I u = f . j = 1 To obtain such Kronecker structure in general: ◮ tensorized domain; ◮ highly structured grid; ◮ coefficients that can be written/approximated as sum of separable functions. 26

  4. High-dimensional PDE-eigenvalue problems PDE-eigenvalue problem in Ω = [ 0 , 1 ] d , ∆ u ( ξ ) + V ( ξ ) u ( ξ ) = λ u ( ξ ) u ( ξ ) = 0 on ∂ Ω . Assumption: Potential represented as � s V ( 1 ) ( ξ 1 ) V ( 2 ) ( ξ 2 ) · · · V ( d ) V ( ξ ) = ( ξ d ) . j j j j = 1 � finite difference discretization A u = ( A L + A V ) u = λ u , with d � A L = I ⊗ · · · ⊗ I ⊗ A L ⊗ I ⊗ · · · ⊗ I , � �� � � �� � j = 1 d − j times j − 1 times s � A ( d ) V , j ⊗ · · · ⊗ A ( 2 ) V , j ⊗ A ( 1 ) A V = V , j . j = 1 27

  5. Quantum many-body problems ◮ spin- 1 / 2 particles: proton, neutron, electron, and quark. ◮ two states: spin-up, spin-down ◮ quantum state for each spin represented by vector in C 2 (spinor) ◮ quantum state for system of d spins represented by vector in C 2 d ◮ quantum mechanical operators expressed in terms of Pauli matrices � 0 � � 0 � � 1 � 1 − i 0 P x = , P y = , P z = . 1 0 i 0 0 − 1 ◮ spin Hamiltonian: sum of Kronecker products of Pauli matrices and identities � each term describes physical (inter)action of spins ◮ interaction of spins described by graph ◮ Goal: Compute ground state of spin Hamiltonian. 28

  6. Quantum many-body problems Example: 1d chain of 5 spins with periodic boundary conditions 1 2 3 4 5 Hamiltonian describing pairwise interaction between nearest neighbors: H = P z ⊗ P z ⊗ I ⊗ I ⊗ I + I ⊗ P z ⊗ P z ⊗ I ⊗ I + I ⊗ I ⊗ P z ⊗ P z ⊗ I + I ⊗ I ⊗ I ⊗ P z ⊗ P z + P z ⊗ I ⊗ I ⊗ I ⊗ P z 29

  7. Quantum many-body problems ◮ Ising (ZZ) model for 1d chain of d spins with open boundary conditions: p − 1 � H = I ⊗ · · · ⊗ I ⊗ P z ⊗ P z ⊗ I ⊗ · · · ⊗ I k = 1 p � + λ I ⊗ · · · ⊗ I ⊗ P x ⊗ I ⊗ · · · ⊗ I k = 1 λ = ratio between strength of magnetic field and pairwise interactions ◮ 1d Heisenberg (XY) model ◮ Current research: 2d models. ◮ More details in: Huckle/Waldherr/Schulte-Herbrüggen: Computations in Quantum Tensor Networks. Schollwöck: The density-matrix renormalization group in the age of matrix product states. 30

  8. Stochastic Automata Networks (SANs) ◮ 3 stochastic automata A 1 , A 2 , A 3 having 3 states each. ∈ R 3 describes probabilities of states ( 1 ) , ( 2 ) , ( 3 ) in A i ◮ Vector x ( i ) t at time t ◮ No coupling between automata � local transition x ( i ) �→ x ( i ) t t + 1 described by Markov chain: x ( i ) t + 1 = E i x ( i ) t , with a stochastic matrix E i . ◮ Stationary distribution of A i = Perron vector of E i (eigenvector for eigenvalue 1). 31

  9. Stochastic Automata Networks (SANs) ◮ 3 stochastic automata A 1 , A 2 , A 3 having 3 states each. ◮ Coupling between automata � local transition x ( i ) �→ x ( i ) t + 1 not t described by Markov chain. ◮ Need to consider all possible combinations of states in ( A 1 , A 2 , A 3 ) : ( 1 , 1 , 1 ) , ( 1 , 1 , 2 ) , ( 1 , 1 , 3 ) , ( 1 , 2 , 1 ) , ( 1 , 2 , 2 ) , . . . . ◮ Vector x t ∈ R 3 3 (or tensor X ( t ) ∈ R 3 × 3 × 3 ) describes probabilities of combined states. 32

  10. Stochastic Automata Networks (SANs) ◮ Transition x t �→ x t + 1 described by Markov chain: x t + 1 = E x t , with a large stochastic matrix E . ◮ Oversimplified example: I ⊗ I ⊗ � E 1 + I ⊗ � E 2 ⊗ I + � E = E 3 ⊗ I ⊗ I . � �� � local transition + I ⊗ E 21 ⊗ E 12 + E 32 ⊗ E 23 ⊗ I � �� � � �� � interaction between A 1 , A 2 interaction between A 2 , A 3 ◮ Goal: Compute stationary distribution = Perron vector of E . ◮ More details in: Stewart: Introduction to the Numerical Solution of Markov Chains. Chapter 9. Buchholz: Product Form Approximations for Communicating Markov Processes. 33

  11. Further applications Other applications in scientific computing featuring low-rank tensor concepts: ◮ Boltzmann equation [Ibragimov/Rjasanow’2009]. ◮ Dynamical systems [Koch/Lubich’2009]. ◮ Parabolic PDEs [Andreev/Tobler’2011], [Khoromskij’2009]. ◮ Stochastic PDEs [Khoromskij/Schwab’2010], [Matthies/Zander’2011], [Kressner/Tobler’2011], [Ballani/Grasedyck/Kluge’2011], . . . ◮ Electronic structure calculation [Chinnamsetty et al.’2007], [Flad et al.’2009], [Khoromskij/Khoromskaja’2009], [Limpanuparb/Gill’2009], [Benedikt et al.’2011], [Mohlenkamp’2011], . . . ◮ Evaluation of boundary integrals (in BEM): [Grasedyck], [Khoromskij/Sauter/Veit’2011]. ◮ . . . 34

  12. Summary ◮ Large diversity of applications leading to linear systems / eigenvalue problems with Kronecker product structures. ◮ For many problems of practical interest: Explicit storage / computation of solution infeasible. ◮ Increasing use of low-rank tensor techniques. Heaviest use currently: DMRG for quantum many-body problems. ◮ Remark: For PDE-related applications, high dimensionality can also be addressed during the discretization phase (sparse grids, adaptive sparse discretization, . . . ). Has advantages and disadvantages. 35

  13. Approximate low-rank matrices ◮ Singular value decomposition ◮ Separability and low rank ◮ Separability by polynomial interpolation ◮ Separability by exponential sums ◮ Low rank of snapshot matrices 36

  14. Low-rank approximation Setting: Matrix X ∈ R n × m , m and n too large to compute/store X explicitly. Idea: Replace X by RS T with R ∈ R n × r , S ∈ R m × r and r ≪ m , n . RS T X Memory nm nr + rm r Cost ops ( m , n ) ops ( m , n ) × min { m , n } (?) � � X − RS T � 2 : R ∈ R n × r , S ∈ R m × r � min = σ k + 1 . with singular values σ 1 ≥ σ 2 ≥ · · · ≥ σ min { m , n } of X . 37

  15. Construction from singular value decomposition SVD: Let matrix X ∈ R n × m and k = min { m , n } . Then ∃ orthonormal matrices � � � � ∈ R n × k , ∈ R m × k , U = u 1 , u 2 , . . . , u k V = v 1 , v 2 , . . . , v k such that X = U Σ V T , Σ = diag ( σ 1 , σ 2 , . . . , σ k ) . Choose r ≤ k and partition � � Σ 1 � � � � T = U 1 Σ 1 0 V T + U 2 Σ 2 V T X = U 1 , U 2 V 1 , V 2 2 . 1 0 Σ 2 � �� � ���� =: R =: S T Then � X − RS T � 2 = � Σ 2 � 2 = σ r + 1 . Good low rank approximation if singular values decay sufficiently fast. Also: span ( X ) ≈ span ( R ) , span ( X T ) ≈ span ( S T ) 38

  16. Discretization of bivariate function � � � � ◮ Bivariate function: f ( x , y ) : x min , x max × y min , y max → R . ◮ Function values on tensor grid [ x 1 , . . . , x n ] × [ y 1 , . . . , y m ] :   f ( x 1 , y 1 ) f ( x 1 , y 2 ) · · · f ( x 1 , y n )  f ( x 2 , y 1 ) f ( x 2 , y 2 ) · · · f ( x 2 , y n )    F = . . .   . . . . . . f ( x m , y 1 ) f ( x m , y 2 ) · · · f ( x m , y n ) Basic but crucial observation: f ( x , y ) = g ( x ) h ( y ) �     g ( x 1 ) h ( y 1 ) · · · g ( x 1 ) h ( y n ) g ( x 1 )     . . . h ( y n ) ] F = . .  = .  [ h ( y 1 ) · · ·   . . . g ( x m ) h ( y 1 ) · · · g ( x m ) h ( y n ) g ( x m ) Separability implies rank 1. 39

  17. Separability and low rank Approximation by sum of separable functions f ( x , y ) = g 1 ( x ) h 1 ( y ) + · · · + g r ( x ) h r ( y ) + error . � �� � =: f r ( x , y ) Define   f r ( x 1 , y 1 ) · · · f r ( x 1 , y n )   . . F r = . .  .  . . f r ( x m , y 1 ) · · · f r ( x m , y n ) Then F r has rank ≤ r and � F − F r � F ≤ √ mn × error . � √ σ r + 1 ( F ) ≤� F − F r � 2 ≤ � F − F r � F ≤ mn × error . Semi-separable approximation implies low-rank approximation. 40

  18. Semi-separable approximation by polynomials Solution of approximation problem f ( x , y ) = g 1 ( x ) h 1 ( y ) + · · · + g r ( x ) h r ( y ) + error . not trivial; g j , h j can be chosen arbitrarily! General construction by polynomial interpolation: 1. Lagrange interpolation of f ( x , y ) in y -coordinate: r � I y [ f ]( x , y ) = f ( x , θ j ) L j ( y ) j = 1 with Lagrange polynomials L j of degree r − 1 on [ x min , x max ] . 2. Interpolation of I y [ f ] in x -coordinate: r r � � I x [ I y [ f ]]( x , y ) = f ( ξ i , θ j ) L i ( x ) L j ( y ) ˆ = L i , x ( x ) L j , y ( y ) , i , j = 1 i = 1 where f [ f ( ξ i , θ j )] i , j is “diagonalized” by SVD. 41

  19. Semi-separable approximation by polynomials error ≤ � f − I x [ I y [ f ]] � ∞ = � f − I x [ f ] + I x [ f ] − I x [ I y [ f ]] � ∞ ≤ � f − I x [ f ] � ∞ + � I x � ∞ � f − I y [ f ] � ∞ with Lebesgue constant � I x � ∞ ∼ log r when using Chebyshev interpolation nodes. Polynomial interpolation error typically much too pessimistic ◮ Lebesgue constants hit hard in high dimensions: ( log r ) d − 1 . ◮ Severe theoretical barriers for general smooth multivariate functions: E. Novak and H. Wo´ zniakowski: Tractability of Multivariate Problems, Volume I and II. EMS. 42

  20. Semi-separable approximation of 1 / ( x + y ) Consider 1 f ( x , y ) = x + y , x , y ∈ [ α, β ] , 0 < α < β. Apply numerical quadrature: � ∞ r � 1 e − tz d t = ω j e − γ j z + error . z = 0 j = 1 Inserting z = x + y � r r � � 1 ω j e − γ j ( x + y ) + error = ω j e − γ j x e − γ j y + error . x + y = j = 1 j = 1 Choice of nodes γ j > 0 and weights ω j > 0 as in [Stenger’93, Braess’86, Braess/Hackbusch’05] � � � r π 2 error ≤ 8 − . | α | exp log ( 8 β/α ) 43

  21. Semi-separable approximation by exponential sums ◮ Consider more general case of function f ( x , y ) := g ( x + y ) . ◮ Approximation of g ( z ) with z := x + y by exponential sum r � g ( z ) ≈ ω j exp ( γ j z ) (1) j = 1 for some coefficients γ j , ω j ∈ R . ◮ (1) gives semi-separable approximation for f : r � f ( x , y ) = g ( x + y ) ≈ ω j exp ( γ j ( x + y )) j = 1 r � = ω j exp ( γ j x ) exp ( γ j y ) . j = 1 ◮ Naturally extends to arbitrarily many variables. ◮ Problem: (1) nontrivial approx problem [Braess’1986], [Hackbusch’2006], . . . 44

  22. Low-rank approximation of snapshot matrices Vector-valued function x ( α ) : [ α min , α max ] → R n Sampling at α 1 , . . . , α m ∈ [ α min , α max ] : � � Snapshot matrix X = = x ( α 1 ) , x ( α 2 ) , . . . , x ( α m ) 45

  23. Example: Baking 1 cookie Stationary heat equation with pw constant heat conductivity σ ( x , α ) : in Ω = [ − 1 , 1 ] 2 −∇ ( σ ( x , α ) ∇ u ) = f u = 0 on ∂ Ω , 2 ◮ σ ( baking tray ) = 1 1.5 ◮ σ ( cookie ) = 1 + α ◮ Undetermined parameter 1 0.5 α ∈ [ α min , α max ] . 0 0 0.5 1 1.5 2 # Vertices : 455, # Elements : 825, # Edges : 1279 Standard FE discretization results in linearly parameter-dependent linear system ( A 0 + α A 1 ) x ( α ) = b . 46

  24. Singular value decay – observation ◮ 1 Cookie: n = 371 , m = 101. log 10 (singular values of snapshot matrix) 5 0 −5 −10 −15 −20 0 20 40 60 80 100 ◮ Foundation of Proper Orthogonal Decomposition and Reduced Basis Methods. 47

  25. Singular value decay – explanation Polynomial approximation: x ( α ) = x 0 + α x 1 + α 2 x 2 + · · · + α k − 1 x k − 1 + error . Approximation error: ◮ Assume b ( · ) , A ( · ) analytic � x ( · ) analytic. ◮ Then error � ρ − k , where ρ > 1 depends on domain of analyticity of A , b . (Proof: Direct extension of classical result for scalar-valued functions.) 48

  26. Singular value decay – explanation Polynomial approximation: x ( α ) = x 0 + α x 1 + α 2 x 2 + · · · + α k − 1 x k − 1 + error . Snapshot matrix: � � X = x ( α 1 ) , x ( α 2 ) , . . . , x ( α m )   1 1 . . . 1   α 1 α 2 . . . α m � �   = x 0 , x 1 , . . . , x k − 1  + error  . . .  . . .  . . . α k − 1 α k − 1 α k − 1 . . . m 1 2 = matrix of rank k + error σ k + 1 ( X ) ≤ error � ρ − k Remark: Trivially extends to pw analytic case. 49

  27. Singular value decay – pw analytic case Example: Consider smallest singular value σ ( z ) and corresponding right singular vector v ( z ) of B ( z ) = A − i zI for z ∈ [ − 1 , 1 ] . ◮ A = 2 × 2 block diag randn , n = 400. ◮ s ( z ) only Lipschitz ◮ Snapshot matrix of singular vectors: cont, but pw anal. � � ◮ v ( z ) discontinuous, X = v ( z 1 ) , v ( z 2 ) , . . . , v ( z 100 ) but pw anal. for equidistant samples z j ∈ [ − 1 , 1 ] . σ ( z ) Singular values of X 5 0.03 10 0.025 0 10 0.02 −5 10 0.015 −10 10 0.01 −15 10 0.005 −20 0 10 −1 −0.5 0 0.5 1 0 20 40 60 80 100 z 50

  28. Summary Need strong singular value decay for good low-rank approximations. For function-related matrices/tensors: Strong link to semi-separable approximations. Smoothness seems to be important... at least somehow. ◮ Fortunately, smoothness is not necessary. Piecewise smoothness can be enough. ◮ Unfortunately, smoothness is not sufficient for higher-order tensors. ◮ Need to impose stronger regularity as dimension/order d increases, based, e.g., on mixed weak derivatives [Yserentant: Regularity and approximability of electronic wave functions. 2010]. 51

  29. Low-rank tensors: CP and Tucker ◮ CP ◮ Tucker ◮ Higher-order SVD ◮ Tensor networks 52

  30. CP decomposition ◮ Aim: Generalize concept of low rank from matrices to tensors. ◮ One possibility motivated by � �� � T = X = a 1 , a 2 , . . . , a R b 1 , b 2 , . . . , b R a 1 b T 1 + a 2 b T 2 + · · · + a R b T = R . � vectorization vec ( X ) = b 1 ⊗ a 1 + b 2 ⊗ a 2 + · · · + b R ⊗ a R . C anonical P olyadic decomposition of tensor X ∈ R n 1 × n 2 × n 3 defined via vec ( X ) = c 1 ⊗ b 1 ⊗ a 1 + c 2 ⊗ b 2 ⊗ a 2 + · · · + c R ⊗ b R ⊗ a R for vectors a j ∈ R n 1 , b j ∈ R n 2 , c j ∈ R n 3 . CP directly corresponds to semi-separable approximation. Tensor rank of X = minimal possible R 53

  31. CP decomposition Illustration of CP decomposition vec ( X ) = c 1 ⊗ b 1 ⊗ a 1 + c 2 ⊗ b 2 ⊗ a 2 + · · · + c R ⊗ b R ⊗ a R . c 1 c r b 1 b r X a 1 a r 54

  32. CP decomposition ◮ CP decomposition offers low data-complexity; for constant R : linear complexity in d . ◮ For matrices : ◮ rank r is upper semi-continuous � closedness property: sequence of rank = r matrices can only converge to rank ≤ r matrix. ◮ best low-rank approximation possible by successive rank-1 approximations. ◮ Robust black-box algorithms/software available ( svd , Lanczos). For tensors of order d ≥ 3: ◮ tensor rank R is not upper semi-continuous � lack of closedness ◮ successive rank-1 approximations fail ◮ all algorithms based on optimization techniques (ALS, Gauss-Newton) Picture taken from [Kolda/Bader’2009]. 55

  33. Tucker decomposition ◮ Aim: Generalize concept of low rank from matrices to tensors. ◮ Alternative possibility motivated by A = U · Σ · V T , U ∈ R n 1 × r , V ∈ R n 2 × r , Σ ∈ R r × r . � vectorization � � vec ( X ) = V ⊗ U · vec (Σ) . Ignore diagonal structure of Σ and call it C . Tucker decomposition of tensor X ∈ R n 1 × n 2 × n 3 defined via � � vec ( X ) = W ⊗ V ⊗ U · vec ( C ) with U ∈ R n 1 × r 1 , V ∈ R n 2 × r 2 , W ∈ R n 3 × r 3 , and core tensor C ∈ R r 1 × r 2 × r 3 . In terms of µ -mode matrix products: X = U ◦ 1 V ◦ 2 W ◦ 3 C =: ( U , V , W ) ◦ C . 56

  34. Tucker decomposition Illustration of Tucker decomposition X = ( U , V , W ) ◦ C V W C U X 57

  35. Tucker decomposition Consider all three matricizations: � � T , U · C ( 1 ) · X ( 1 ) = W ⊗ V � � T , V · C ( 2 ) · X ( 2 ) = W ⊗ U � � T . W · C ( 3 ) · X ( 3 ) = V ⊗ U These are low rank decompositions � � X ( 1 ) � � X ( 2 ) � � X ( 3 ) � rank ≤ r 1 , rank ≤ r 2 , rank ≤ r 3 . Multilinear rank of tensor X ∈ R n 1 × n 2 × n 3 defined by tuple � X ( i ) � ( r 1 , r 2 , r 3 ) , with r i = rank . 58

  36. Higher-order SVD (HOSVD) Goal: Approximate given tensor X by Tucker decomposition with prescribed multilinear rank ( r 1 , r 2 , r 3 ) . 1. Calculate SVD of matricizations: X ( µ ) = � U µ � Σ µ � V T for µ = 1 , 2 , 3 . µ 2. Truncate basis matrices: U µ := � U µ (: , 1 : r µ ) for µ = 1 , 2 , 3 . 3. Form core tensor: � � U T 3 ⊗ U T 2 ⊗ U T vec ( C ) := · vec ( X ) . 1 Truncated tensor produced by HOSVD [Lathauwer/De Moor/Vandewalle’2000]: � � � � � vec X := U 3 ⊗ U 2 ⊗ U 1 · vec ( C ) . Remark: � � Orthogonal projection � X with π µ X := U µ U T X := π 1 ◦ π 2 ◦ π 3 µ ◦ µ X . 59

  37. Higher-order SVD (HOSVD) Tensor � X resulting from HOSVD satisfies quasi-optimality condition √ �X − � X� ≤ d �X − X best � , where X best is best approximation of X with multilinear ranks ( r 1 , . . . , r d ) . Proof : X� 2 = �X − ( π 1 ◦ π 2 ◦ π 3 ) X� 2 �X − � = �X − π 1 X� 2 + � π 1 X − ( π 1 ◦ π 2 ) X� 2 + · · · · · · + � ( π 1 ◦ π 2 ) X − ( π 1 ◦ π 2 ◦ π 3 ) X� 2 ≤ �X − π 1 X� 2 + �X − π 2 X� 2 + �X − π 3 X� 2 Using �X − π µ X� ≤ �X − X best � for µ = 1 , 2 , 3 leads to X� 2 ≤ 3 · �X − X best � 2 . �X − � Best approximation : See [Kolda/Bader’09]. 60

  38. Tucker decomposition – Summary For general tensors: ◮ multilinear rank r is upper semi-continuous � closedness property. ◮ HOSVD – simple and robust algorithm to obtain quasi-optimal low-rank approximation. ◮ quasi-optimality good enough for most applications in scientific computing. ◮ robust black-box algorithms/software available (e.g., Tensor Toolbox). Drawback: Storage of core tensor ∼ r d � curse of dimensionality 61

  39. Tensor network diagrams Tensor network = undirected graph with: ◮ each node is a tensor; ◮ each outgoing edge is a mode; ◮ each connected edge represents a contraction; example: 2 1 r � 3 1 Z i 1 , i 2 , i 3 , i 4 = X i 1 , i 2 , j Y j , i 3 , i 4 . 2 3 j = 1 ◮ number of free edges = order of tensor represented by entire network Researchers on quantum many-body problems think 2 in terms of tensor networks! 2 and dream 62

  40. Tensor network diagrams Examples: (i) (ii) (iii) (iv) (v) 1 2 1 1 1 3 3 2 2 1 2 1 2 2 2 2 2 1 2 1 1 1 1 (i) vector; (ii) matrix; (iii) matrix-matrix multiplication; (iv) Tucker decomposition; (v) hierarchical Tucker decomposition. 63

  41. Low-rank tensors: Hierarchical Tucker ◮ Intro of Hierarchical Tucker Decomposition (HTD) ◮ M ATLAB toolbox htucker ◮ Basic operations: µ -mode matrix multiplication, addition, . . . ◮ Advanced Operations: inner product, elementwise multiplication, . . . 64

  42. Introduction ◮ CP offers low data complexity but difficult truncation; ◮ Tucker offers simple truncation but high data complexity. Recently developed formats: ◮ Matrix Product State (MPS), ◮ TT decomposition, ◮ Hierarchical Tucker decomposition (HTD). Aim to offer compromise between CP and Tucker. Focus in this lecture: HTD. ◮ L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl. , 31(4):2029–2054, 2010. ◮ W. Hackbusch and S. Kühn. A new scheme for the tensor representation. J. Fourier Anal. Appl. , 15(5):706–722, 2009. ◮ D. Kressner and C. Tobler. htucker – A M ATLAB toolbox for the hierarchical Tucker decomposition. In preparation. See http://www.math.ethz.ch/~ctobler . 65

  43. More general matricizations Recall: µ -mode matricization for tensor X , X ( µ ) ∈ R n µ × ( n 1 ··· n µ − 1 n µ + 1 ··· n d ) , µ = 1 , . . . , d . It is getting ugly... General matricization for mode de- composition { 1 , . . . , d } = t ∪ s : X ( t ) ∈ R ( n t 1 ··· n tk ) × ( n s 1 ··· n sd − k ) X with � X ( t ) � ( i t 1 ,..., i tk ) , ( i s 1 ,..., i sd − k ) := X i 1 ,..., i d . X ( 1 ) X ( 1 , 2 ) 66

  44. Hierarchical construction Singular value decomposition: X ( t ) = U t Σ t U T s . Column spaces are nested � t = t 1 ∪ t 2 ⇒ span ( U t ) ⊂ span ( U t 2 ⊗ U t 1 ) ⇒ ∃ B t : U t = ( U t 2 ⊗ U t 1 ) B t . Size of U t : � X ( t ) � U t ∈ R n t 1 ··· n tk × r t with r t = rank . For d = 4: = ( U 2 ⊗ U 1 ) B 12 U 12 U 34 = ( U 4 ⊗ U 3 ) B 34 vec ( X ) = X ( 1234 ) = ( U 34 ⊗ U 12 ) B 1234 ⇒ vec ( X ) = ( U 4 ⊗ U 3 ⊗ U 2 ⊗ U 1 )( B 34 ⊗ B 12 ) B 1234 . 67

  45. Dimension tree Tree structure for d = 4: U 1 ( n 1 × r 1 ) B 12 ( r 1 r 2 × r 12 ) ( r 1 r 2 × r 12 ) U 2 ( n 2 × r 2 ) B 1234 ( r 12 r 34 × 1 ) U 3 ( n 3 × r 3 ) B 34 ( r 3 r 4 × r 34 ) U 4 ( n 4 × r 4 ) Reshape: B 12 ∈ R r 1 r 2 × r 12 B 12 ∈ R r 1 × r 2 × r 12 ⇒ B 34 ∈ R r 3 r 4 × r 34 B 34 ∈ R r 3 × r 4 × r 34 ⇒ B 1234 ∈ R r 12 r 34 × 1 B 1234 ∈ R r 12 × r 34 ⇒ 68

  46. Dimension tree U 1 B 12 U 2 B 1234 U 3 B 34 U 4 ◮ Often, U 1 , U 2 , U 3 , U 4 are orthonormal. This is advantageous but not required. ◮ Storage requirements for general d : O ( dnr ) + O ( dr 3 ) , where r = max { r t } , n = max { n µ } . 69

  47. Constructors for M ATLAB class htensor x = htensor([4 5 6 7]) constructs zero htensor of size 4 × 5 × 6 × 7, with a balanced dimension tree. x = htensor([4 5 6 7], ’TT’) constructs zero htensor of size 4 × 5 × 6 × 7, with a TT-style dimension tree. x = htensor({U1, U2, U3}) constructs htensor from tensor in CP decomp X ( i 1 , i 2 , i 3 ) = � j U 1 ( i 1 , j ) U 2 ( i 2 , j ) U 3 ( i 3 , j ) . x = htenrandn([4 5 6 7]) constructs htensor of size 4 × 5 × 6 × 7, with random ranks and random entries. x = htenones([4 5 6 7]) constructs htensor of size 4 × 5 × 6 × 7, with all entries one. ... 70

  48. Basic functionality for M ATLAB class htensor Example: x is in htensor of order 4. x(1, 3, 4, 2) returns entry of X . x(1, 3, :, :) returns slice of X as an htensor . full(x) returns full tensor represented by X . (use with care) disp_tree(htenrand([5 4 6 3])) returns tree structure/ranks: ans is an htensor of size 5 x 4 x 6 x 3 1-4 1; 6 3 1 1-2 2; 3 4 6 1 4; 5 3 2 5; 4 4 3-4 3; 3 3 3 3 6; 6 3 4 7; 3 3 spy(x) displays spy plots of U t , B t , on the dimension tree. change_root(x, i) switches root node. 71

  49. Singular value tree plot_sv (x) plots singular values of corresponding matricizations in the dimension tree of a tensor X . Example: Singular value tree of solution to elliptic PDE with 4 parameters. Dim. 1, 2 Dim. 3, 4, 5 Dim. 1 Dim. 2 Dim. 3 Dim. 4, 5 Dim. 4 Dim. 5 Remark: Singular values are computed from Gramians. 72

  50. Basic ops: µ -mode matrix multiplication Application of matrix A ∈ R m × n µ to mode µ of X ∈ R n 1 ×···× n d : Y ( µ ) = AX ( µ ) . Y = A ◦ µ X ⇔ Nearly trivial if X is in H -Tucker format: � � A ◦ µ X = A ◦ µ ( U 1 , . . . , U d ) ◦ C = ( U 1 , . . . , U µ − 1 , AU µ , U µ + 1 , . . . , U d ) ◦ C ◮ Almost no operations required. ◮ Ranks stay the same. ◮ Orthogonality destroyed. ttm(x, A, 2) applies matrix A to htensor X in mode 2. y = ttm(x, {A, B, C}, [2, 3, 4]) y = ttm(x, @(x)(fft(x)), 2) applies FFT in mode 2. y = ttm(x, {A, B, C}, [2, 3, 4], ’h’) successively applies matrices A T , B T , C T in modes 2 , 3 , 4. 73

  51. Addition of low-rank matrices Addition of two matrices in low-rank format: A = U 1 Σ A U T B = V 1 Σ B V T 2 , 2 ⇒ � � Σ A � � U 2 � U 1 � T 0 A + B = V 1 V 2 0 Σ B ◮ No operations required. ◮ Rank increases. ◮ Orthogonality destroyed. 74

  52. Addition of low-rank tensors Addition of four tensors X 1 , X 2 , X 3 , X 4 in H -Tucker format: X 1 + X 2 + X 3 + X 4 . Proceed as in matrix case by embedding factors in larger matrices. ◮ No operations required. ◮ H -Tucker rank increases. ◮ Orthogonality destroyed. Command in htucker : x1 + x2 + x3 + x4 75

  53. U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 1 1 1 1 B [ 1 ] 12 B [ 2 ] 12 B [ 3 ] 12 B [ 4 ] 12 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 2 2 2 2 B [ 1 ] 1234 B [ 2 ] 1234 B [ 3 ] 1234 B [ 4 ] 1234 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 3 3 3 3 B [ 1 ] 34 B [ 2 ] 34 B [ 3 ] 34 B [ 4 ] 34 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 4 4 4 4 76

  54. Orthogonalization Any tensor X in H -Tucker format can be orthogonalized in the sense that all factors in the dimension tree, except for the root node, contain orthonormal columns. Example: vec ( X ) = ( U 4 ⊗ U 3 ⊗ U 2 ⊗ U 1 )( B 34 ⊗ B 12 ) B 1234 . Step 1: QR decompositions U t = Q t R t � vec ( X ) = ( Q 4 ⊗ Q 3 ⊗ Q 2 ⊗ Q 1 )( � B 34 ⊗ � B 12 ) B 1234 with � B 34 := ( R 4 ⊗ R 3 ) B 34 , � B 12 := ( R 2 ⊗ R 1 ) B 12 . Step 2: QR decompositions � B 34 = Q 34 R 34 , � B 12 = Q 12 R 12 � vec ( X ) = ( Q 4 ⊗ Q 3 ⊗ Q 2 ⊗ Q 1 )( Q 34 ⊗ Q 12 ) � B 1234 with � B 1234 := ( R 34 ⊗ R 12 ) B 1234 . Compt. requirements for general d : O ( dnr 2 ) + O ( dr 4 ) . Command in htucker : x = orthog(x) 77

  55. Norms and inner products Inner product of two tensors X , Y ∈ R n 1 ×··· n d : n 1 n d � � �X , Y� = � vec ( X ) , vec ( Y ) � = · · · x i 1 ,..., i d y i 1 ,..., i d . i 1 = 1 i d = 1 Can be performed efficiently in H -Tucker, provided that X , Y have compatible dimension trees. Example: Two tensors of order 4 � 1 ) T · · · ( B x 1234 ) T ( B x 34 ⊗ B x 12 ) T ( U x 4 ⊗ U x 3 ⊗ U x 2 ⊗ U x �X , Y� = · · · ( U y 4 ⊗ U y 3 ⊗ U y 2 ⊗ U y 1 )( B y 34 ⊗ B y 12 ) B y 1234 Norm: After X has been orthogonalized: � �X , X� = � B x �X� = 12 ··· d � F . Possibly most accurate way to compute norm. Used in norm(x) . 78

  56. Computation of inner products n 1 n d � � �X , Y� = · · · x i 1 ,..., i d y i 1 ,..., i d . i 1 = 1 i d = 1 79

  57. Computation of inner products 80

  58. Computation of inner products 81

  59. Computation of inner products 82

  60. Computation of inner products 83

  61. Computation of inner products – contraction step B y t t 1 ) T U y t 2 ) T U y ( U x ( U x t 1 t 2 ( B x t ) T t ) T � � t ) T U y t 2 ) T U y t 1 ) T U y B y ( U x t = ( B x ( U x t 2 ⊗ ( U x t . t 1 ◮ htucker command: innerprod(x,y) ◮ Overall cost: O ( dnr 2 ) + O ( dr 4 ) . 84

  62. Reduced Gramians in H -Tucker t U t U t t G t X ( t ) = U t V T X ( t ) ( X ( t ) ) T = U t V T U T ⇒ t V t t t � �� � =: G t � � X ( t ) � If U t orthonormal � svd = eig ( G t ) (used in plot_sv ). 85

  63. Reduced Gramians in H -Tucker 86

  64. Reduced Gramians in H -Tucker 87

  65. Reduced Gramians in H -Tucker 88

  66. Reduced Gramians in H -Tucker 89

  67. Reduced Gramians in H -Tucker 90

  68. Reduced Gramians in H -Tucker Implemented in htucker command gramians(x) . 91

  69. Advanced operations ◮ Truncation ◮ Combined addition + truncation ◮ Elementwise multiplication ◮ Elementwise reciprocal 92

  70. Truncation of explicit tensor Let X ∈ R n 1 × n 2 ×···× n d be explicitly given. ◮ For each tree node t , let W t contain r t dominant left singular vectors of X ( t ) and define projection π t X ( t ) = W t W T π t X = W t W T t X ( t ) . t ◦ t X ⇔ ◮ Truncated tensor: � � � � � � ˜ X := π t · · · π t X , t ∈T L t ∈T 1 where T ℓ contains all nodes on level ℓ . √ ◮ [Grasedyck’2010]: �X − ˜ X� ≤ 2 d − 3 �X − X best � . Proof similar as for HOSVD. 93

  71. Truncation of explicit tensor Example: vec ˜ X = ( W 4 W T 4 ⊗ W 3 W T 3 ⊗ W 2 W T 2 ⊗ W 1 W T 1 )( W 34 W T 34 ⊗ W 12 W T 12 ) vec X = ( W 4 ⊗ W 3 ⊗ W 2 ⊗ W 1 ) · · · ([ W T 4 ⊗ W T ⊗ [ W T 2 ⊗ W T ) ([ W T 34 ⊗ W T 3 ] W 34 1 ] W 12 12 ] vec X ) . � �� � � �� � � �� � =: B 34 =: B 12 =: B 1234 opts.max_rank = 10 maximal rank at truncation. opts.rel_eps = 1e-6 maximal relative truncation error. opts.abs_eps = 1e-6 maximal absolute truncation error. Condition max_rank takes precedence over rel_eps and abs_eps . xt = htensor.truncate_rtl(x, opts) returns truncated tensor ˜ X of a multidimensional array. Remark: There is also a significantly faster htensor.truncate_ltr (proceeds successively from leafs to roots), for which the same error bound holds [Tobler’10]. 94

  72. Truncation of H -Tucker tensor Let X ∈ R n 1 × n 2 ×···× n d be in H -Tucker format and orthogonalized . ◮ Compute left singular vectors of X ( t ) = U t V T t from eigenvectors of X ( t ) � X ( t ) � T = U t V T U T t V t t , � �� � = G t with reduced Gramian G t . If S t contains r t dominant eigenvectors of G t � W t = U t S t . ◮ Traverse tree from root to leafs. In each step: B t p S T B t p t ◦ B t p S T t S t S T t S t B t S t ◦ B t B t ◮ In htucker : truncate(x,opts) . Complexity O ( dnr 2 + dr 4 ) . 95

  73. Combined addition + truncation Sum of more than two tensors: Y = X 1 + X 2 + · · · + X s . Two possibilities to incorporate truncation operator T : 1. Y ≈ T ( X 1 + X 2 + X 3 + · · · + X s ) 2. Y ≈ T ( · · · ( T ( T ( X 1 + X 2 ) + X 3 ) + · · · + X s ) Option 2 is usually significantly cheaper but may suffer from severe cancellation. Artificial example: X 1 , X 2 , X 3 ∈ R 101 × 101 × 101 truncated tensor grid discretizations for summands of f ( x 1 , x 2 , x 3 ) = tan ( x 1 + x 2 + x 3 ) + ( x 1 + x 2 + x 3 ) − 1 − tan ( x 1 + x 2 + x 3 ) . Error(Option 1) ≈ 10 − 7 . Error(Option 2) ≈ 1 . 3. What is wrong with Option 1? 96

  74. Combined addition + truncation U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 1 1 1 1 B [ 1 ] 12 B [ 2 ] 12 B [ 3 ] 12 B [ 4 ] 12 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 2 2 2 2 B [ 1 ] 1234 B [ 2 ] 1234 B [ 3 ] 1234 B [ 4 ] 1234 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 3 3 3 3 B [ 1 ] 34 B [ 2 ] 34 B [ 3 ] 34 B [ 4 ] 34 U [ 1 ] U [ 2 ] U [ 3 ] U [ 4 ] 4 4 4 4 ◮ Orthogonalization (needed before truncation) destroys block diagonal structure. ◮ Complexity O ( dns 2 r 2 + ds 4 r 4 ) for s summands. 97

  75. Combined addition + truncation Idea: New variant delays orthogonalization to keep block diagonal structure in transfer tensors as long as possible. Reduces O ( dns 2 r 2 + ds 4 r 4 ) to O ( dns 2 r 2 + ds 2 r 4 + ds 3 r 3 ) 2 10 time truncate std time truncate sum 1 time truncate succ. 10 O(t 4 ) Runtime [s] O(t 2 ) 0 10 O(t) −1 10 −2 10 0 1 10 10 Number of summands ◮ htucker command: add_truncate(x1 x2 x3 x4, opts) . 98

  76. Elementwise multiplication Elementwise multiplication (also called Hadamard or Schur product) of two low-rank matrices A = U 1 Σ A U T 2 , B = V 1 Σ B V T 2 : ⊙ V 2 ) T , A ⋆ B = ( U 1 � ⊙ V 1 )(Σ A ⊗ Σ B )( U 2 � with the row-wise Khatri-Rao product       c T d T c T 1 ⊗ d T 1 1 1  .   .   .  C � .  � . . ⊙ D = ⊙  =     . . . c T d T c T n ⊗ d T n n n ◮ Orthogonality destroyed. ◮ Rank increases significantly . But: singular value decay of Σ A ⊗ Σ B may become significantly stronger � additional opportunities for truncation. 99

  77. Elementwise multiplication Elementwise multiplication of two tensors X , Y in H -Tucker format: ◮ Row-wise Khatri-Rao product of leaf matrices. ◮ “Kronecker product” of non-leaf tensors. ◮ Optional: Products are only formed after suitable truncation to avoid excessive memory requirements. Commands in htucker : x.*y (without truncation) x.ˆ2 (without truncation) elem_mult( x, y, opt ) (with truncation) 100

Recommend


More recommend