Sparse Computations and Multi-BSP Sparse Computations and Multi-BSP Albert-Jan Yzelman October 11, 2016 Parallel Computing & Big Data Huawei Technologies France Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP BSP machine = { sequential processor } + interconnect The machine is described entirely by ( p , g , L ): strobing synchronisation, homogeneous processing, uniform full-duplex network, Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP BSP algorithm: strobing barriers full overlap h -relation bottlenecks: max s { sent s , recv s } work balance L. G. Valiant, A bridging model for parallel computation , CACM, 1990 Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP BSP cost: w (0) w (1) h (1) T p = max + L + max { max + L , max s g + L } + . . . s s s s s Separation of computation vs. communication. L. G. Valiant, A bridging model for parallel computation , CACM, 1990 Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP BSP cost: w (0) w (1) h (1) T p = max + L + max { max + L , max s g + L } + . . . s s s s s Separation of algorithm vs. hardware. L. G. Valiant, A bridging model for parallel computation , CACM, 1990 Albert-Jan Yzelman
Sparse Computations and Multi-BSP Immortal algorithms The BSP paradigm, allows the design of immortal algorithms : given a problem to compute given a BSP computer ( p , g , l ) find the BSP algorithm that attains provably minimal cost. E.g., fast Fourier transforms, matrix-matrix multiplication. Thinking in Sync : the Bulk-Synchronous Parallel approach to large-scale computing. Bisseling and Yzelman, ACM Hot Topic ’16. http://www.computingreviews.com/hottopic/hottopic_essay.cfm?htname=BSP Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } 4: y s = A s x s { local multiplication stage } Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } 4: y s = A s x s { local multiplication stage } 5: for i | ∃ a ij ∈ A s and π y ( i ) � = s do send ( i , y s , i ) to π y ( i ) 6: 7: sync { execute fan-in } Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Variables A s , x s , y s are local versions of the global variables A , x , y distributed according to π A , π x , π y . 1: for j | ∃ a ij � = 0 ∈ A s and π x ( j ) � = s do get x π x ( j ) , j 2: 3: sync { execute fan-out } 4: y s = A s x s { local multiplication stage } 5: for i | ∃ a ij ∈ A s and π y ( i ) � = s do send ( i , y s , i ) to π y ( i ) 6: 7: sync { execute fan-in } 8: for all ( i , α ) received do add α to y s , i 9: Rob H. Bisseling, “Parallel Scientific Computation”, Oxford Press, 2004. Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Suppose π A assigns every nonzero a ij ∈ A to processor π A ( i , j ). If 1 π y ( i ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } and 2 π x ( j ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } ; Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Suppose π A assigns every nonzero a ij ∈ A to processor π A ( i , j ). If 1 π y ( i ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } and 2 π x ( j ) ∈ { s | ∃ a ij ∈ A , π A ( i , j ) = s } ; then � � λ col fan-out communication scatters � − 1 elements from x , j j i ( λ row fan-in communication gathers � − 1) elements from y , i where λ row = |{ s | ∃ a ij ∈ A s }| and i λ col = |{ s | ∃ a ij ∈ A s }| . j Minimising the λ − 1 metric minimises total communication volume . Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Partitioning combined with reordering illustrates clear separators: 1 2 3 4 1 2 3 4 Group nonzeroes a ij for which π A ( i ) = π A ( j ), permute rows i with λ i > 1 in between, apply recursive bipartitioning. Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication When partitioning in both dimensions: Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Full 2D: + g ( h fan-out + h fan-in ) + 2 l . s p Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Full 2D: + g ( h fan-out + h fan-in ) + 2 l . s p Memory overhead (buffers): � � � � � � � ( λ row λ col = O Θ − 1) + − 1 . p 1 λ> 1 i i i j λ : λ row ∪ λ col Albert-Jan Yzelman
Sparse Computations and Multi-BSP BSP sparse matrix–vector multiplication Classical worst-case bounds (in flops): (1 + ǫ ) + n / p ( √ p − 1)(2 g + 1) + 2 l . 2 nz ( A ) Block: p 2 nz ( A ) Row 1D: (1 + ǫ ) + gh fan-out + l . p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Col 1D: + gh fan-in + l . s p 2 nz ( A ) (1 + ǫ ) + max s recv fan-in Full 2D: + g ( h fan-out + h fan-in ) + 2 l . s p Memory overhead (buffers): � � � � � � � ( λ row λ col = O Θ − 1) + − 1 . p 1 λ> 1 i i i j λ : λ row ∪ λ col Depending on the higher-level algorithm: fan-in latency can be hidden behind other kernels, fan-out latency can be hidden as well. Albert-Jan Yzelman
Sparse Computations and Multi-BSP Multi-BSP Multi-BSP computer = p ( subcomputers or processors ) + M bytes of local memory+ an interconnect Albert-Jan Yzelman
Sparse Computations and Multi-BSP Multi-BSP Multi-BSP computer = p ( subcomputers or processors ) + M bytes of local memory+ an interconnect A total of 4 L parameters: ( p 0 , g 0 , l 0 , M 0 , . . . , p L − 1 , g L − 1 , l L − 1 , M L − 1 ). Advantages: memory-aware, non-uniform! Albert-Jan Yzelman
Sparse Computations and Multi-BSP Multi-BSP Multi-BSP computer = p ( subcomputers or processors ) + M bytes of local memory+ an interconnect A total of 4 L parameters: ( p 0 , g 0 , l 0 , M 0 , . . . , p L − 1 , g L − 1 , l L − 1 , M L − 1 ). Advantages: memory-aware, non-uniform! Disadvantages: (likely) harder to prove optimality. L. G. Valiant, A bridging model for multi-core computing , CACM 2011. Albert-Jan Yzelman
Sparse Computations and Multi-BSP Multi-BSP An example with L = 3 quadlets ( p , g , l , M ): C = (2 , g 0 , l 0 , M 0 ) (4 , g 1 , l 1 , M 1 ) (8 , g 2 , l 2 , M 2 ) Each quadlet runs its own BSP SPMD program. Albert-Jan Yzelman
Sparse Computations and Multi-BSP Multi-BSP SpMV multiplication SPMD-style Multi-BSP SpMV multiplication: define process 0 at level − 1 as the Multi-BSP root. let process s at level k have parent t at level k − 1. define ( A − 1 , 0 , x − 1 , 0 , y − 1 , 0 ) = ( A , x , y ), the original input. Albert-Jan Yzelman
Recommend
More recommend