Problem Outline Experiments Summary Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms Jean-Noël Quintin, Khalid Hasanov, Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie 2013 1 / 38
Problem Outline Experiments Summary Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA 2 / 38
Problem Outline Experiments Summary Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 2 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 3 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Motivation ◮ Majority of HPC algorithms for scientific applications were introduced between 1970s and 1990s ◮ They were designed for and tested on up to hundreds (few thousands at most) of processors. ◮ In June 1995, the number of cores in the top 10 supercomputers ranged from 42 to 3680 (see http://www.top500.org/) 4 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Motivation ◮ Majority of HPC algorithms for scientific applications were introduced between 1970s and 1990s ◮ They were designed for and tested on up to hundreds (few thousands at most) of processors. ◮ In June 1995, the number of cores in the top 10 supercomputers ranged from 42 to 3680 (see http://www.top500.org/) ◮ Nowadays, in June 2013, this number ranges from 147,456 to 3,120,000 4 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Motivation The increasing scale of the HPC platforms creates new research questions which needs to be solved: ◮ Scalability ◮ Communication cost ◮ Energy efficiency ◮ etc. 5 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Introduction We focus on the communication cost of scientific applications on large-scale distributed memory platforms. ◮ Example application: Parallel Matrix Multiplication. ◮ Why Matrix Multiplication? ◮ Matrix multiplication is important in its own rights as a computational kernel of many scientific applications. ◮ It is a popular representative for other scientific applications ◮ If an optimization method works well for matrix multiplication, it will also work well for many other relative scientific applications 6 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Introduction ◮ Example algorithm: ◮ SUMMA - Scalable Universal Matrix Multiplication Algorithm. ◮ Introduced by Robert A. van de Geijn and Jerrell Watts. University of Texas at Austin, 1995. ◮ Implemented in ScaLAPACK. 7 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 8 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA SUMMA A b • k P 00 P 01 P 02 P 03 P 04 P 05 P 00 P 01 P 02 P 03 P 04 P 05 P 10 P 11 P 12 P 13 P 14 P 15 P 10 P 11 P 12 P 13 P 14 P 15 B b P 20 P 21 P 22 P 23 P 24 P 25 P 20 P 21 P 22 P 23 P 24 P 25 k • P 30 P 31 P 32 P 33 P 34 P 35 P 30 P 31 P 32 P 33 P 34 P 35 P 40 P 41 P 42 P 43 P 44 P 45 P 40 P 41 P 42 P 43 P 44 P 45 P 50 P 51 P 52 P 53 P 54 P 55 P 50 P 51 P 52 P 53 P 54 P 55 √ √ ◮ Number of steps: n b ( n × n - matrices, b - block size, P × P - processors grid, P = 36) ◮ The pivot column A b n • k of P × b blocks of matrix A is broadcast horizontally. √ k • of b × n ◮ The pivot row B b P blocks of matrix B is broadcast vertically. √ ◮ Then, each n P × n P block c ij of matrix C is updated, c ij = c ij + a ik × b kj . √ √ ◮ Size of data broadcast vertically and horizontally in each step: 2 n P × b √ 9 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 10 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Our Contribution ◮ We introduce application level hierarchical optimization of SUMMA ◮ Hierarchical SUMMA (HSUMMA) is platform independent optimization of SUMMA ◮ We theoretically and experimentally show that HSUMMA reduces the communication cost of SUMMA 11 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA SUMMA vs HSUMMA. Arrangement of Processors P 00 P 01 P 02 P 03 P 04 P 05 P 10 P 11 P 12 P 13 P 14 P 15 P 20 P 21 P 22 P 23 P 24 P 25 P 30 P 31 P 32 P 33 P 34 P 35 P 40 P 41 P 42 P 43 P 44 P 45 P 50 P 51 P 52 P 53 P 54 P 55 SUMMA 12 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA SUMMA vs HSUMMA. Arrangement of Processors P 00 P 01 P 02 P 03 P 04 P 05 P 00 P 01 P 02 P 03 P 04 P 05 P 10 P 11 P 12 P 13 P 14 P 15 P 10 P 11 P 12 P 13 P 14 P 15 P 20 P 21 P 22 P 23 P 24 P 25 P 20 P 21 P 22 P 23 P 24 P 25 P 30 P 31 P 32 P 33 P 34 P 35 P 30 P 31 P 32 P 33 P 34 P 35 P 40 P 43 P 45 P 41 P 42 P 44 P 40 P 41 P 42 P 43 P 44 P 45 P 50 P 51 P 52 P 53 P 54 P 55 P 50 P 51 P 52 P 53 P 54 P 55 SUMMA HSUMMA 12 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Horizontal Communications Between Groups in HSUMMA A M • k P 00 P 01 P 02 P 03 P 04 P 05 ◮ P - number of processors ( P = 36) ◮ G - number of groups ( G = 9) P 10 P 11 P 12 P 13 P 14 P 15 ◮ √ √ P × P - processors grid ◮ √ √ P 20 P 21 P 22 P 23 P 24 P 25 G × G - grid of processor groups ◮ M - block size between groups P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M - number of steps P 40 P 41 P 42 P 43 P 44 P 45 ◮ Size of data broadcast horizontally in each step: n × M √ P 50 P 51 P 52 P 53 P 54 P 55 P The pivot column A M n • k of P × M blocks of matrix A is broadcast horizontally √ between groups 13 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Horizontal Communications Inside Groups in HSUMMA A b A b A b • k • k • k P 00 P 01 P 02 P 03 P 04 P 05 √ √ P P G − grid of processors inside groups P 10 P 11 P 12 P 13 P 14 P 15 √ G × √ ◮ ◮ b − block size inside one group P 20 P 21 P 22 P 23 P 24 P 25 ◮ M / b − steps inside one group P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M − steps between groups ◮ Size of data broadcast horizontally in each step: n × b P 40 P 41 P 42 P 43 P 44 P 45 √ P P 50 P 51 P 52 P 53 P 54 P 55 Upon receipt of the pivot column data from the other groups, the local pivot column A b n • k , ( b ≤ M ) of P × b blocks of matrix A is broadcast horizontally √ inside each group 14 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Vertical Communications Between Groups in HSUMMA P 00 P 01 P 02 P 03 P 04 P 05 ◮ P - number of processors ( P = 36) ◮ G - number of groups ( G = 9) P 10 P 11 P 12 P 13 P 14 P 15 ◮ √ √ P × P - processors grid ◮ √ √ B M P 20 P 21 P 22 P 23 P 24 P 25 G × G - grid of processor groups k • ◮ M - block size between groups P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M - number of steps P 40 P 41 P 42 P 43 P 44 P 45 ◮ Size of data broadcast vertically in each step: n × M √ P 50 P 51 P 52 P 53 P 54 P 55 P The pivot row B M k • of M × n P blocks of matrix B is broadcast vertically √ between groups 15 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Vertical Communications Inside Groups in HSUMMA B b P 00 P 01 P 02 P 03 P 04 P 05 k • √ √ P P G − grid of processors P 10 P 11 P 12 P 13 P 14 P 15 ◮ √ G × √ ◮ b − block size inside one group B b P 20 P 21 P 22 P 23 P 24 P 25 ◮ M / b − steps inside one group k • P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M − steps between groups ◮ Size of data broadcast B b vertically in each step: n × b P 40 P 41 P 42 P 43 P 44 P 45 √ k • P P 50 P 51 P 52 P 53 P 54 P 55 Upon receipt of the pivot row data from the other groups, the local pivot row B b • k of b × n P , ( b ≤ M ) blocks of matrix B is broadcast vertically inside each √ group 16 / 38
Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Communication Model to Analyse SUMMA and HSUMMA Time of sending of a message of size m between two processors: α + m β (1) Here, ◮ α -latency ◮ β -reciprocal bandwith ◮ m -message size 17 / 38
Recommend
More recommend