Mapping CSP Networks to MPI Clusters Using Channel Graphs and Dynamic Instrumentation Gabriella Azzopardi Kevin Vella Adrian Muscat University of Malta
Outline • Introduction • 1. CSP-Library • 2. Configuration Language • 3. CSP-based Concurrent Applications • 4. Automatic Mapping Algorithms • Results • Conclusion
Introduction • Distributing an application's processes across a cluster improves computational performance • Therefore after implementing an application the developer must then search for the most e ffi cient way to run it in parallel - time consuming • A number of algorithms have been studied to map applications onto the underlying architecture • This work seeks to create this same mapping e ff ect for CSP-based concurrent applications and evaluates their performance • The initial aim is to provide the necessary tools to implement and map CSP-based applications in order to automate the mapping process, then to study di ff erent automatic techniques and see how they compare
Introduction High level application mapping onto a cluster • The idea is to automatically map an application with any number of processes onto a cluster with any number of nodes in order to best utilize the available resources • This was achieved in 4 parts which will be explained in the following slides
1. CSP-Library A CSP-based message passing channel must first be implemented to allows for communication between an application’s parallel processes. This was done by using MPI and POSIX threads to implement the following: • Channels - Provide communication between pairs of processes. Three types of channels are defined: Internal, External and Timer. • Parallel - This brings together a number of processes such that they are executed concurrently. The processes start together and the Parallel terminates when all combined processes have terminated.
1. CSP-Library • Alternation - This combines a number of processes whereby only one of them is chosen for execution. This is a separate process which is provided with a list of channels and randomly selects one which is ready in order to receive the data to be sent. • Placed Parallel - Similar to Parallel however the concurrent processes are executed on di ff erent nodes, which were assigned through a predefined mapping.
2. Configuration Language A configuration language must then be established to provide a means for mapping such applications easily on to a cluster. This was developed using JSON to allow users to manually map processes onto nodes. This is included in a separate file so that a single application can have various mappings. Three sections are defined: • Application - This section lists all the channels used in the application and the pair of processes each channel connects. The processes in this section are identified using a unique ID, which is then referenced in the mapping section.
2. Configuration Language • Mapping - Each process is referenced by a unique ID, shared with the application section and is assigned a rank upon which it will execute. This section groups mappings with a unique ID, in order to allow multiple mappings to be used in the same application. • Global - This must be the first section and is used to declare any variables to be used in the mapping and application sections that follow. Such variables can also be edited from the application at runtime, allowing for the configuration to dynamically correspond to the application.
3. CSP-based Concurrent Applications A number of concurrent applications with various programming patterns were then developed using the CSP library and mapped using the configuration language. Applications are representing using a graph model in order to facilitate application partitioning for future mappings. • Sort Pump - Linear application which sorts a list of numbers • N-place Bu ff er - Linear application creating a bu ff er of N-processes between the sending and receiving processes • Single Filter - Geometric application simulating image filtering using a single Gaussian filter, by dividing it horizontally • Single 2D Filter - Farming application simulating image filtering using a single Gaussian filter, by dividing it in both planes and master process uses Alt function
3. CSP-based Concurrent Applications • Double Filter - Geometric application which simulates image filtering using the same Gaussian filter twice • Mergesort - Binary tree application to sort a list of numbers
4. Automatic Mapping Algorithms The final step is to automate the mapping of such applications using various partitioning algorithms. Following are the mapping techniques used: • Simple - Random, Linear from graph and Weighted Scatter according to execution times. • Min-Max Greedy - Greedily assigns processes to partitions with the aim of achieving minimum cost and maximum gain • Breadth-First Search - Traverses the graph starting from a root and vertices are assigned a partition according to their distance from the root • K-way Bisection - Recursive Graph Bisection groups nearby vertices and Kernighan Lin iteratively swaps processes between partitions if gain is better • Simulated Annealing - Optimization algorithm which constantly searches for a better mapping solution than the previous, by using neighbouring partitioning solutions • K-Means Clustering - Unsupervised learning algorithm which groups connected processes together after selecting initial centroids
Application Statistics The algorithms which use application information require an initial run of the application and the following data is recorded for each individual channel and each placed parallel process: • Channel total time - Total time spent waiting on the channel by the first process, until the second process arrives AND transfers the data • Channel communication time - Total time taken to actually transfer all the data across the channel • Channel usage - Total number of times the channel was used • Channel data size - Total amount of bytes transferred across the channel • Process total time - Total time taken by a process to execute
Results • The 6 applications were mapped and executed with the 9 mapping algorithms • 3 separate mappings using each algorithm were generated per application • Applications were run using 1, 2, 4, 6, 8 nodes and 1, 2, 4[, 6] cores on 2 di ff erent clusters • Each application instance was run 3 times, and execution time was recorded in each case • 2 versions of MPI were used: MPI Hydra (MPICH) and MVAPICH
Results Example: • MVAPICH results for the Sort Pump on one core • Results indicate the Linear algorithm generated the largest speedup, whereas the Weighted Scatter, the least
Conclusion • This work provides the necessary tools for developing CSP-based concurrent applications • The framework developed will save developers a significant amount of time and e ff ort when generating mappings for their CSP applications • Results indicate that using a mapping algorithm to map such applications can be beneficial • Large tree-depth graphs (eg: Sort Pump) - Partitioning algorithms which divide an application without adding extra external channels performed better (eg: Linear) • Short tree-depth graphs - Partitioning algorithms which prioritize equality of partitions proved to be more e ff ective
Questions?
Channel Usage Create/Destroy Send/Receive Timer Channel
Channel Implementation Internal Channel External Channel
Application Statistics • Application information is collected using two versions of the CSP library. The first version calculates all channel and process times by setting the TIME flag, whereas the second version calculates the total execution time without timing overheads • The following data is extracted and used by the graph partitioning algorithms: where Chan_comms_time is the total communication time of all channels used by the current process
Sort Pump Mappings • Sort Pump application (small scale): • Linear mapping: • Weighted Scatter mapping:
Recommend
More recommend