Parallel Numerical Algorithms Chapter 3 Dense Linear Systems - PowerPoint PPT Presentation

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 – Dense Linear Systems Section 3.1 – Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Outline BLAS 1 Inner Product 2 Outer Product 3 Matrix-Vector Product 4 Matrix-Matrix Product 5 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Basic Linear Algebra Subprograms Basic Linear Algebra Subprograms ( BLAS ) are building blocks for many other matrix computations BLAS encapsulate basic operations on vectors and matrices so they can be optimized for particular computer architecture while high-level routines that call them remain portable BLAS offer good opportunities for optimizing utilization of memory hierarchy Generic BLAS are available from netlib , and many computer vendors provide custom versions optimized for their particular systems Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Examples of BLAS Level Work Examples Function 1 O ( n ) Scalar × vector + vector daxpy Inner product ddot Euclidean vector norm dnrm2 O ( n 2 ) 2 Matrix-vector product dgemv dtrsv Triangular solve Outer-product dger O ( n 3 ) 3 Matrix-matrix product dgemm dtrsm Multiple triangular solves Symmetric rank- k update dsyrk γ 1 > γ 2 γ 3 ≫ �� BLAS 1 effective sec/flop BLAS 2 effective sec/flop BLAS 3 effective sec/flop Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Inner Product Inner product of two n -vectors x and y given by n � x T y = x i y i i =1 Computation of inner product requires n multiplications and n − 1 additions M 1 = Θ( n ) , Q 1 = Θ( n ) , T 1 = Θ( γ n ) Effectively as hard as scalar reduction, can be done via binary or binomial tree summation Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Partition For i = 1 , . . . , n , fine-grain task i stores x i and y i , and computes their product x i y i Communicate Sum reduction over n fine-grain tasks x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 x 5 y 5 x 6 y 6 x 7 y 7 x 8 y 8 x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Fine-Grain Parallel Algorithm z i = x i y i { local scalar product } reduce z i across all tasks i = 1 , ..., n { sum reduction } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Agglomeration and Mapping Agglomerate Combine k components of both x and y to form each coarse-grain task, which computes inner product of these subvectors Communication becomes sum reduction over n/k coarse-grain tasks Map Assign ( n/k ) /p coarse-grain tasks to each of p processors, for total of n/p components of x and y per processor x 1 y 1 + x 2 y 2 + x 3 y 3 x 4 y 4 + x 5 y 5 + x 6 y 6 x 7 y 7 + x 8 y 8 + x 9 y 9 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Coarse-Grain Parallel Algorithm z i = x T [ i ] y [ i ] { local inner product } { sum reduction } reduce z i across all processors i = 1 , ..., p � � x [ i ] – subvector of x assigned to processor i Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Performance The parallel costs ( L p , W p , F p ) for the inner product are given by Computational cost F p = Θ( n/p ) regardless of network The latency and bandwidth costs depend on network: 1-D mesh: L p , W p = Θ( p ) 2-D mesh: L p , W p = Θ( √ p ) hypercube: L p , W p = Θ(log p ) For a hypercube or fully-connected network time is T p = αL p + βW p + γF p = Θ( α log( p ) + γn/p ) Efficiency and scaling are the same as for binary tree sum Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Inner product on 1-D Mesh For 1-D mesh, total time is T p = Θ( γn/p + αp ) To determine strong scalability, we set constant efficiency and solve for p s � � � � T 1 γn 1 const = E p s = = Θ = Θ γn + αp 2 1 + ( α/γ ) p 2 p s T p s s /n s � which yields p s = Θ( ( γ/α ) n ) 1-D mesh weakly scalable to p w = Θ(( γ/α ) n ) processors: � � � � 1 1 E p w ( p w n ) = Θ = Θ 1 + ( α/γ ) p 2 w / ( p w n ) 1 + ( α/γ ) p w /n Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 77

BLAS Inner Product Parallel Algorithm Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Inner product on 2-D Mesh For 2-D mesh, total time is T p = Θ( γn/p + α √ p ) To determine strong scalability, we set constant efficiency and solve for p s � � � � T 1 γn 1 const = E p s = = Θ = Θ p s T p s γn + αp 3 / 2 1 + ( α/γ ) p 3 / 2 /n s s which yields p s = Θ(( γ/α ) 2 / 3 n 2 / 3 ) 2-D mesh weakly scalable to p w = Θ(( γ/α ) 2 n 2 ) , since � � � � 1 1 E p w ( p w n ) = Θ = Θ 1 + ( α/γ ) √ p w /n 1 + ( α/γ ) p 3 / 2 w / ( p w n ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 77

BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Outer Product Outer product of two n -vectors x and y is n × n matrix Z = xy T whose ( i, j ) entry z ij = x i y j For example, T       x 1 y 1 x 1 y 1 x 1 y 2 x 1 y 3 x 2 y 2 = x 2 y 1 x 2 y 2 x 2 y 3       x 3 y 3 x 3 y 1 x 3 y 2 x 3 y 3 Computation of outer product requires n 2 multiplications, M 1 = Θ( n 2 ) , Q 1 = Θ( n 2 ) , T 1 = Θ( γn 2 ) (in this case, we should treat M 1 as output size or define the problem as in the BLAS: Z = Z input + xy T ) Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77

BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Parallel Algorithm Partition For i, j = 1 , . . . , n , fine-grain task ( i, j ) computes and stores z ij = x i y j , yielding 2-D array of n 2 fine-grain tasks Assuming no replication of data, at most 2 n fine-grain tasks store components of x and y , say either for some j , task ( i, j ) stores x i and task ( j, i ) stores y i , or task ( i, i ) stores both x i and y i , i = 1 , . . . , n Communicate For i = 1 , . . . , n , task that stores x i broadcasts it to all other tasks in i th task row For j = 1 , . . . , n , task that stores y j broadcasts it to all other tasks in j th task column Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77

BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Fine-Grain Tasks and Communication x 1 y 1 x 1 y 2 x 1 y 3 x 1 y 4 x 1 y 5 x 1 y 6 x 2 y 1 x 2 y 2 x 2 y 3 x 2 y 4 x 2 y 5 x 2 y 6 x 3 y 1 x 3 y 2 x 3 y 3 x 3 y 4 x 3 y 5 x 3 y 6 x 4 y 1 x 4 y 2 x 4 y 3 x 4 y 4 x 4 y 5 x 4 y 6 x 5 y 1 x 5 y 2 x 5 y 3 x 5 y 4 x 5 y 5 x 5 y 6 x 6 y 1 x 6 y 2 x 6 y 3 x 6 y 4 x 6 y 5 x 6 y 6 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77

BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Fine-Grain Parallel Algorithm broadcast x i to tasks ( i, k ) , k = 1 , . . . , n { horizontal broadcast } broadcast y j to tasks ( k, j ) , k = 1 , . . . , n { vertical broadcast } z ij = x i y j { local scalar product } Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77

BLAS Parallel Algorithm Inner Product Agglomeration Schemes Outer Product Scalability Matrix-Vector Product Matrix-Matrix Product Agglomeration Agglomerate With n × n array of fine-grain tasks, natural strategies are 2-D: Combine k × k subarray of fine-grain tasks to form each coarse-grain task, yielding ( n/k ) 2 coarse-grain tasks 1-D column: Combine n fine-grain tasks in each column into coarse-grain task, yielding n coarse-grain tasks 1-D row: Combine n fine-grain tasks in each row into coarse-grain task, yielding n coarse-grain tasks Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems - PowerPoint PPT Presentation

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.3 Parallel

+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms for Heterogeneous Parallel Computers Antonio M. Vidal Maci a

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Algorithms for Natural Language Processing Lecture 2: Words and Morphology Linguistic

A V L t r e e s ( W e i s s 4 . 4 ) B a l a n c e d B S T s : t h

1 Concepts from 3.1-3.2 Func4onal dependencies Keys

TUPA at MRP 2019 A Multi-Task Baseline System CoNLL Shared Task 3 November 2019 1 / 9 Daniel

Efficient arithmetic regularity and removal lemmas for induced bipartite patterns Yufei Zhao

Quantifying Temporal and Spatial Quantifying Temporal and Spatial Localities Localities Florida

Kevin Fox ESCoE COVID-19 ECONOMIC MEASUREMENT WEBINARS UNSW Business School Centre for Applied

PM Sensors Aerial Intelligence (SAI) Industry Opportunities Distribution Statement A: Approved

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems - PowerPoint PPT Presentation

BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.3 Parallel

+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms

Parallel Algorithms Parallel Algorithms Examples Examples Concepts &amp; Definitions

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular

+ Design of Parallel Algorithms Parallel Sorting Algorithms + Topic Overview n Issues in

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Parallel Numerical Algorithms for Heterogeneous Parallel Computers Antonio M. Vidal Maci a

+ Design of Parallel Algorithms Parallel Dense Matrix Algorithms + Topic Overview n

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Algorithms for Natural Language Processing Lecture 2: Words and Morphology Linguistic

A V L t r e e s ( W e i s s 4 . 4 ) B a l a n c e d B S T s : t h

1 Concepts from 3.1-3.2 Func4onal dependencies Keys

TUPA at MRP 2019 A Multi-Task Baseline System CoNLL Shared Task 3 November 2019 1 / 9 Daniel

Efficient arithmetic regularity and removal lemmas for induced bipartite patterns Yufei Zhao

Quantifying Temporal and Spatial Quantifying Temporal and Spatial Localities Localities Florida

Kevin Fox ESCoE COVID-19 ECONOMIC MEASUREMENT WEBINARS UNSW Business School Centre for Applied

PM Sensors Aerial Intelligence (SAI) Industry Opportunities Distribution Statement A: Approved

Parallel Algorithms Parallel Algorithms Examples Examples Concepts & Definitions