Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS - PowerPoint PPT Presentation

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby Hayashi , Grey Ballard, Yujie Jiang, Michael Tobia hayakb13@,ballard@,jiany14@,tobiamj@wfu.edu

Neuroimaging Application ensor: Time by Subjects by oxel Correlation Matrix est: Rest à Activity à ecovery Subjects: Control, MDD, SAD, COMO Time

Quick Introduction to Tensors Multidimensional arrays, an N-dimensional nsors is said to be N-way or order-N. 4-way 5-way ay 2-way 3-way

CP Decomposition nonical Polyadic composition (CP): Decomposes a tensor into a sum of rank 1 tensors 𝒴≈ ∑𝑑 =0 ↑𝐷 −1 ▒𝑣↓𝑗𝑑 ∘ 𝑤↓𝑘𝑑 ∘ 𝑥↓𝑙𝑑 𝒴≈ ⟦𝑉 , 𝑊 , 𝑋⟧

CP via Alternating Least Squares

Hadamard Product ↑𝑈 𝑉↓ 0 )∗…∗ ( 𝑉↓𝑜 −1 ↑𝑈 𝑉↓𝑜 −1 )∗ ( 𝑉↓𝑜 +1 ↑𝑈 𝑉↓𝑜 +1 )∗…∗ ( 𝑉↓𝑂 −1 ↑𝑈 𝑉↓𝑂 −1 ) Element wise matrix product denoted * 𝐷 = 𝐵 ∗ 𝐶 𝐷↓𝑗𝑘 = 𝐵↓𝑗𝑘 ∗ 𝐶↓𝑗𝑘 𝐷↓𝑗𝑘 𝐵↓𝑗𝑘 𝐶↓𝑗𝑘 = 𝐽 * 𝐾 𝐾 𝐾

Khatri Rao Product = 𝒀↓ 𝒀↓(𝒐) ( 𝑽↓ 𝑽↓𝑶 −𝟐 ⨀…⨀ 𝑽↓ 𝑽↓𝒐 +𝟐 ⨀ 𝑽↓ 𝑽↓𝒐 −𝟐 ⨀…⨀ 𝑽↓ 𝑽↓ 𝟏 ) = ⊙ A (𝑠↓𝐵 ,: ) Khatri Rao Product (KRP): B (𝑠↓𝐶 𝐿 = 𝐵 ⊙ 𝐶 𝐽↓𝐵 ∙ 𝐽↓𝐶 𝐷 𝐷 olumn-wise Kronecker Product 𝐿( :, 𝑗) = 𝐵( :, 𝑗) ⊙ 𝐶( :, 𝑗) r Hadamard Product of Rows 𝐿(𝑠↓𝐶 + 𝑠↓𝐵 𝐽↓𝐶 ,: ) 𝐿(𝑠↓𝐶 + 𝑠↓𝐵 𝐽↓𝐶 ,: ) = 𝐵(𝑠↓𝐵 ,: ) ∗ 𝐶(𝑠↓𝐶 ,: ) 𝐷

Tensor Fibers 𝑜 =0, 𝒴 ↓ (: 𝑘𝑙 ) 𝑜 =1, 𝒴 ↓ ( 𝑗 : 𝑙 ) 𝑜 =2, 𝒴 ↓ ( 𝑗𝑘 :)

Unfolding Tensors 𝐍= 𝒀↓ 𝒀↓(𝒐) ( 𝑽↓ 𝑽↓𝑶 −𝟐 ⨀…⨀ 𝑽↓ 𝑽↓𝒐 +𝟐 ⨀ 𝑽↓ 𝑽↓𝒐 −𝟐 ⨀…⨀ 𝑽↓ 𝑽↓ 𝟏 ) 𝐽↓ ≠ 𝑜 • The n th mode matricization of a N- way tensor 𝒴 that is 𝐽↓ 0 × 𝐽↓ 1 ×…× 𝐽↓𝑂 −1 𝑌↓ ( 𝑜 ) 𝐽↓𝑜 is denoted 𝑌↓ ( 𝑜 ) and is 𝐽↓𝑜 × 𝐽↓ ≠ 𝑜 o 𝐽↓ ≠ 𝑜 = ∏𝑜 ≠ 𝑙 ∈[ 𝑂 ] ↑▒𝐽↓𝑙 𝑌↓ ( 𝑛 : 𝑜 ) denotes a matricization • where {𝑛 , 𝑛 +1,…, 𝑜} are the row modes ∏𝑙 ={ 𝑛 , 𝑛 +1,…, 𝑜 } ↑▒𝐽↓𝑙 𝑌↓ ( 𝑛 : 𝑜 )

Matricized Tensor Times Khatri Rao Product 𝑁 = 𝑌↓(𝑜) ( 𝑉↓ 0 ⨀…⨀ 𝑉↓𝑜 −1 ⨀ 𝑉↓𝑜 +1 ⨀…⨀ 𝑉↓𝑂 −1 ) Naïve algorithm Permute 𝒴 to 𝑌↓(𝑜) 𝐷 1. Form K= ( 𝑉↓ 0 ⨀…⨀ 𝑉↓𝑜 −1 ⨀ 𝑉↓𝑜 +1 ⨀…⨀ 𝑉↓𝑂 −1 ) 2. 𝑌↓(𝑜) 3. Call DGEMM 𝐷 1-Step and 2-Step MTTKRP = 𝐽↓𝑜 Avoid permuting 𝒴 K 1. 𝐽↓𝑜 2. Efficiently form the KRP § 𝐽↓ ≠ 𝑜 1Step o ( 𝑉↓𝑂 −1 ⨀…⨀ 𝑉↓𝑜 +1 ⨀ 𝑉↓𝑜 −1 ⨀…⨀ 𝑉↓ 0 ) § 2Step o 𝐿↓𝑀 = ( 𝑉↓ 0 ⨀…⨀ 𝑉↓𝑜 −1 ) o 𝐿↓𝑆 = ( 𝑉↓𝑜 +1 ⨀…⨀ 𝑉↓𝑂 ) 3. Utilize BLAS

Computing the KRP Consider 𝐿 = 𝐵 ⨀ 𝐶 ⨀ 𝐷 • 𝐿(𝑘 ,: ) = 𝐵(𝑏 ,: ) ∗ 𝐶(𝑐 ,: ) ∗ 𝐷(𝑑 ,: ) 𝐵( 0,: ) ∗ 𝐶( 0,: ) ⨀ 𝐷 ⨀ ⨀ = 𝐵 𝐶 𝐷 𝐿 𝐽↓𝐵 𝐽↓𝐶 𝐽↓𝑑

Timings for KRPs of naïve and reuse algorithms.

1-Step MTTKRP ! "& 1 void permuting tensor entries ! & ast computation as matmul 2 (4) ) blocks y observation: the n th mode ! ' tricization of a tensor can be tained by chunking the tensor ! ' ) blocks ! ' contiguous submatrices of ual size. 2 (6) ( ! ' ! "#$% ! #$% 9 2 (7$8)

Parallel 1-Step MTTKRP Form 𝐿↓𝑀 Form 𝐿↓𝑆 ( 𝑘 ,:) Form 𝐿(𝑘 ,: ) MatMul Reduce

2-Step MTTKRP • First Compute a Partial MTTKRP 1. Compute 𝐿↓𝑀 and 𝐿↓𝑆 2. ℒ ← 𝑌↓ (0: 𝑜 −1) ↑𝑈 ∙ 𝐿↓𝑀 o ℒ is 𝐽↓𝑜 ×…× 𝐽↓𝑂 −1 × 𝐷 • Second Compute a Series of ___?___ operations. a. Tensor Times Vector (TTVs) b. Tensor Times Matrix (TTMs) c. Quasi-Tensor Times Matrix (q-TTMs)

2-Step MTTKRP: ℒ • First Compute a Partial MTTKRP # ! " & & % % ! " $ ! " ! " $ ! " # ! " 2 1 (*:.-/) ( (*:,-.-/) ' ( = $

2-Step MTTKRP: ℒ • Second Compute a series of TTVs ! blocks ! ! ) * ) * 7 ) * 7 ) * + (-) [0] 8 5 6 (: , 0) = 2(: , 0)

Parallel 2-Step MTTKRP Call Parallel BLAS WOW!!!

60×60×60×60×60

Per iteration time of a CP decomposition via ALS. Matlab used the Tensor Toolbox cp_als function, version 2.6. [1]

Findings wo interesting networks • Positive affect • Negative affect Tobia M., Hayashi K., Ballard G., Gotlib I. Dynamic Functional Connectivity and Individual Differences in Emotions During Social Stress - to appear in uman Brain Mapping

References Tamara G. Kolda and Bre8 W. Bader. 2009. Tensor DecomposiAons and ApplicaAons. SIAM Rev. 51, 3 (Septembe 2009), 455–500. h8ps://doi.org/10.1137/ 07070111X Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. 2017. Model Driven Sparse CP DecomposiAon for Higher-Order Tensors. In IEEE InternaAonal Parallel and Distributed Processing Symposium (IPDPS). 1048–10 h8ps://doi.org/10.1109/IPDPS.2017.80 Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix MulAplicaAon. In Proceedings of the 2015 IEEE InternaAonal Parallel and Distribute Processing Symposium (IPDPS ’15). IEEE Computer Society, Washington, DC, USA, 61–70. h8ps://doi.org/10.1109/ IPDPS.2015.27 D.C. Van Essen, K. Ugurbil, E. Auerbach, D. Barch, T.E.J. Behrens, R. Bucholz, A. Chang, L. Chen, M. Corbe8a, S.W. CurAss, S. Della Penna, D. Feinberg, M.F. Glasser, N. Harel, A.C. Heath, L. Larson-Prior, D. Marcus, G. Michalareas S. Moeller, R. Oostenveld, S.E. Petersen, F. Prior, B.L. Schlaggar, S.M. Smith, A.Z. Snyder, J. Xu, and E. Yacoub. 20 The Human Connectome Project: a data acquisiAon perspecAve. Neuroimage 62, 4 (2012), 2222–2231. h8ps:// doi.org/10. 1016/j.neuroimage.2012.02.018 Anh-Huy Phan, Petr Tichavsky, and Andrzej Cichocki. 2013. Fast AlternaAng LS Algorithms for High Order CANDECOMP/PARAFAC Tensor FactorizaAons. IEEE TransacAons on Signal Processing 61, 19 (Oct 2013), 4834– 4846. h8ps://doi. org/10.1109/TSP.2013.2269903

End Thanks for listening

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS - PowerPoint PPT Presentation

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby Hayashi , Grey Ballard, Yujie Jiang, Michael Tobia hayakb13@,ballard@,jiany14@,tobiamj@wfu.edu Neuroimaging Application ensor: Time by Subjects by

Outline Outline 4 Basic Rules 4 Basic Rules 4 Vectors and Tensors 4 Vectors and Tensors 4

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Computing With Tensors: Modern Algorithm for . . . Modern Algorithm for . . . Potential

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Tensors Lek-Heng Lim Statistics Department Retreat October 27, 2012 Thanks: NSF DMS 1209136 and

09 - Introduction to Tensors Data Mining and Matrices Universitt des Saarlandes, Saarbrcken

A CLT for Wishart Tensors Dan Mikulincer Weizmann Institute of Science 1 Wishart Tensors Let {

Spectral Methods from Tensor Networks Alex Wein Courant Institute, NYU Joint work with Ankur

Todays Topics - Distributed Shared Memory The Shared Memory Abstraction, why? Approaches

1 Terms Actions and Cords Actions t ::= c constant term send t; send a term t x

{Nano|Micro|Mini}-Services? Modularization for Sustainable Systems Stefan Tilkov | innoQ

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi

Programming Shared-memory Platforms with Pthreads Xu Liu Derived from John Mellor-Crummeys

Transactional Memories: a theoretical introduction Selim Arsever & Pascal Perez Shared

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent, Alexander van

7 February 2019 Trust in water 1 Agenda Introduction 1200 to 1210 Alison, Ynon Base

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS - PowerPoint PPT Presentation

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby Hayashi , Grey Ballard, Yujie Jiang, Michael Tobia hayakb13@,ballard@,jiany14@,tobiamj@wfu.edu Neuroimaging Application ensor: Time by Subjects by

Outline Outline 4 Basic Rules 4 Basic Rules 4 Vectors and Tensors 4 Vectors and Tensors 4

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Computing With Tensors: Modern Algorithm for . . . Modern Algorithm for . . . Potential

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Tensors Lek-Heng Lim Statistics Department Retreat October 27, 2012 Thanks: NSF DMS 1209136 and

09 - Introduction to Tensors Data Mining and Matrices Universitt des Saarlandes, Saarbrcken

A CLT for Wishart Tensors Dan Mikulincer Weizmann Institute of Science 1 Wishart Tensors Let {

Spectral Methods from Tensor Networks Alex Wein Courant Institute, NYU Joint work with Ankur

Todays Topics - Distributed Shared Memory The Shared Memory Abstraction, why? Approaches

1 Terms Actions and Cords Actions t ::= c constant term send t; send a term t x

{Nano|Micro|Mini}-Services? Modularization for Sustainable Systems Stefan Tilkov | innoQ

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi

Programming Shared-memory Platforms with Pthreads Xu Liu Derived from John Mellor-Crummeys

Transactional Memories: a theoretical introduction Selim Arsever &amp; Pascal Perez Shared

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent, Alexander van

7 February 2019 Trust in water 1 Agenda Introduction 1200 to 1210 Alison, Ynon Base

Transactional Memories: a theoretical introduction Selim Arsever & Pascal Perez Shared