Euro PVM/MPI 2003 1/22 Venezia, Italia Efficient Parallel Implementation of Transitive Closure of Digraphs C. E. R. Alves Univsidade S˜ ao Judas Tadeu E. N. C´ aceres � Universidade Federal de Mato Grosso do Sul A. A. Castro Jr. � Universidade Cat´ olica Dom Bosco � S. W. Song � Universidade de S˜ ao Paulo J. L. Szwarcfiter � Universidade Federal do Rio de Janeiro � �
2/22 The Transitive Closure Problem • Used in many areas such as – Network Planning – Distributed Systems Design • Used in problems such as – All Shortest Paths in a Directed Graph – Breadth-First Spanning Trees � • Directed graph D ( V, E ) with | V | = n , | E | = m � • We present a parallel algorithm to compute its transi- � tive closure using � – p processors � – each with O ( n 2 p ) local memory � �
3/22 Example 5 3 2 4 6 � � � 1 � A directed graph. � � �
4/22 Example 5 3 2 4 6 � � � 1 � Its transitive closure: green edges joining i to j if j can � be reached from i . � �
5/22 BSP/CGM Model CGM (Coarse Grained Multicomputer) model: p of pro- cessors, each with its own local memory, communicating through a network. The algorithm alternates between • Computation round: each processor computes inde- pendently. • Communication round: each processor sends/receives � data to/from other processors. � � Goals: � • Obtain a linear speed-up on p . � • Minimize the number of rounds. � �
6/22 The CGM Model Computation round Communication round P p − 1 P 2 � � P 1 � Global Communication � Synchronization Barrier P 0 � Local computation � �
7/22 Previous Parallel Algorithms 1. PRAM: • Karp et al.: CREW: O (log 2 n ) time with O ( M ( n )) 1 processors. a: CRCW: O (log n ) time with O ( n 3 ) processors. • J´ aJ´ 2. C´ aceres et al.: Acyclic digraph with linear extension labeling O ( logp ) rounds with O ( n 3 /p ) local time � 3. Dependency Graph Approach: � O ( p ) rounds with O ( n 3 /p ) local • Pagourtzis et al.: � time � � � 1 M ( n ) is the best known sequential bound for multiplying two n × n matrices over a ring �
8/22 Warshall’s Algorithm Algorithm 1: Warshall’s Algorithm Input: Adjacency matrix M n × n of graph G Output: Transitive closure of graph G 1: for k ← 1 until n do for i ← 1 until n do 2: for j ← 1 until n do 3: M [ i, j ] ← M [ i, j ] or ( M [ i, k ] and M [ k, j ]) � 4: end for 5: � end for 6: � 7: end for � � � �
9/22 Partitioning the Adjacency Matrix 1 2 3 4 j k 1 k t i t t 2 � � � 3 � � 4 � �
10/22 The Parallel Algorithm Algorithm 2: Parallel Warshall Input: Adjacency matrix M stored in the p processors: each processor q (1 ≤ q ≤ p ) stores submatrices M [( q − 1) n p + 1 ..q n p ][1 ..n ] and M [1 ..n ][( q − 1) n p + 1 ..q n p ]. Output: Transitive closure of graph G represented by the trans- formed matrix M . � � � � � � �
Algorithm 3: Parallel Warshall 11/22 Each processor q (1 ≤ q ≤ p ) does the following. 1: repeat for k = ( q − 1) n p + 1 until q n p do 2: for i = 0 until n − 1 do 3: for j = 0 until n − 1 do 4: if M [ i ][ k ] = 1 and M [ k ][ j ] = 1 then 5: M [ i ][ j ] = 1 (if M [ i ][ j ] belongs to processor different 6: from q then store it for subsequent transmission to the corresponding processor.) end if 7: � Send stored data to the corresponding processors. 8: Receive data that belong to processor q from other pro- � 9: cessors. � end for 10: � end for 11: � end for 12: � 13: until no new matrix entry updates are done �
12/22 The Main Idea • Make a partition of V ( D ) . • In each partition, using the edges of D construct a digraph formed by the edges of D that have at least one of its extremes in the partition. • Compute the Transitive Closure in each partition. • Send the computed transitive edges to the proper par- tition. � � � � � � �
13/22 Example 1 5 3 2 8 � � 4 6 � � � 7 � �
14/22 Example 1 5 5 3 2 3 2 8 4 6 6 � � � 7 7 � Processor 0 Processor 1 � � �
15/22 Example 1 5 5 3 2 3 2 8 4 6 6 � � � 7 7 � Processor 0 Processor 1 � � �
16/22 Example 1 5 1 5 3 2 8 3 2 8 4 6 4 6 � � � 7 7 � Processor 0 Processor 1 � � �
17/22 Implementation • 64-node Beowulf cluster - low cost microcomputers with 256MB RAM, 256MB swap memory, CPU In- tel Pentium III 448.956 MHz, 512KB cache. • 100 Mb fast-Ethernet switch. • Code in standard ANSI C and LAM-MPI Version 6.5.6. • Tests on randomly generated digraphs with 20 % prob- ability of an edge between two vertices. � � • In all the tests, the number of communication rounds � required are less than log p . � � � �
18/22 Implementation Results • 25 ◦ 480x480 • 512x512 20 ◦ 15 Seconds ◦ � 10 •• ◦ � � 5 ◦ • • • ◦ ◦ � • 0 � 10 20 30 40 50 60 � No. Processors �
19/22 Implementation Results ⋄ 1500 ⋄ 1920x1920 • 1024x1024 ◦ 960x960 1000 Seconds � ⋄ � 500 � ⋄ ◦ � • •• • ⋄ ◦◦ ◦ ⋄ ⋄ ⋄ • • • ◦ ◦ ◦ 0 � 10 20 30 40 50 60 � No. Processors �
20/22 Implementation Results 15 • 10 • Speedup • � ◦ • ◦ � 5 • 512x512 ◦ � •• ◦ 480x480 ◦◦ � • ◦ 0 � 10 20 30 40 50 60 � No. Processors �
21/22 Implementation Results ⋄ 30 ⋄ 20 ◦ ◦ ◦ Speedup � ⋄ � ⋄ • ◦ 10 • � • ⋄ 1920x1920 ⋄ ◦ • 1024x1024 � • ⋄ ◦ 960x960 ◦ •• ◦ • ⋄ � 0 10 20 30 40 50 60 � No. Processors �
22/22 Conclusion A BSP/CGM algorithm for the Transitive Closure problem. • Digraph with n vertices and m edges. • The number of communication rounds measured: O (log p ) . • Local computation time: O ( mn/p ) . � � � � � � �
Recommend
More recommend