An Efficient Implementation of Parallelization in The Domain Decomposition of TELEMAC Dzung Nguyen, T. Liepert, M. Reisenbüchler, K. Kaveh, M. D. Bui, P . Rutschmann Chair of Hydraulic and Water Resources Engineering Technical University of Munich Arcisstr.21. 80333 Munich dzung.nguyen@tum.de well by default [1,2,4]. However, the aim of VieWBay Abstract —In this study, we present an MPI-parallelized implementation of the domain decomposer for TELEMAC. The regarding the mesh size goes beyond that. The intended mesh decomposer, called PARTEL, splits the computational domain size will range between 40 and100 million elements. Deployed into several partitions, which forms the basis of parallel on the CoolMuc-2, a Linux cluster at the Leibniz simulations. In current TELEMAC releases (version 8), the only Supercomputing Center - LRZ [5], TELEMAC v7p2r3 fails to serial mode has been implemented in PARTEL, which gives run simulations with a greatly larger mesh, for instance a mesh room for speeding up the code with a parallel implementation of more than 5 million elements. Therefore, it was urgent for together with a specific enhancement for the internal data our development team to improve the version of TELEMAC. structures used by the code. In this work, we fully parallelized An in-depth analysis later reveals that memory exceedance the domain decomposition using MPI and utilized different was the reason for the failure of the software. implementations of Hash table for representing several data structures in PARTEL. This approach allows us to decompose a CoolMUC-2 has imposed a limit on computer memory huge computational domain consisting of a hundred million (RAM) that we thrive to overcome for TELEMAC. The elements into ten thousand partitions on an ordinary cluster consists of 384 nodes of 28-core Haswell architecture, workstation. The benchmark also revealed that the speed-up by all of which are Infiniband interconnected. Each node is a factor of 4 is readily obtained compared to the original equipped with 64 GB RAM shared by the cores of the same PARTEL before the enhancement. node. In practice, the actual available RAM is reduced by about 10 %, due to runtime system requirements [5]. Under the assumption of an evenly distributed memory usage, each core has about 2 GB RAM to work. In case of a lack of I NTRODUCTION I. memory, the number of activated cores might be reduced in order to raise the available RAM per core. Numerical simulations have proven to be an efficient way By tracing the workflow of TELEMAC v7p2r3, we found to study the characteristics of river hydro- and morph- that the domain decomposer, called PARTEL, was the main dynamics processes. Especially for flood protection, cause of the memory exceedance. By design, PARTEL is a numerical computations are indispensable. The VieWBay- standalone program always running ahead of the simulation project, part of the strategy funded by the Bavarian State code in the parallel mode to split the computational domain Ministry of the Environment and Consumer Protection called into the desired number of partitions. Our performance "Wasser-Zukunft-Bayern", aims to develop an integrative analysis revealed that PARTEL exceeds the memory limit approach combining the fields of hydrology and when dealing with a mesh of larger 5 million elements, thus hydromorphodynamics under the framework of HPC applying fails to complete. The failure, therefore, causes the whole it to whole Bavaria. The hydromorphodynamics model simulation terminated. represents the river network including the propagation of A number of studies that attempted to solve the memory waves and morphological evolution of the streams. The can be found in [1, 2, 3, 4]. The solution in [2] has only intended size of the river system, which has to be modeled for resolved a part of the problem by optimizing the data simulations, should be approximately 10000 kilometers. structures. With that, TELEMAC was able to work with a large Therefore, the demands on the simulation software resulting mesh of up to 25 million elements before being subjected to from the project are high. A preliminary study [3] shows the memory and runtime limits. Among the solutions, [1] is the TELEMAC fits best to the needs of the project, regarding the most complete parallel implementation PARTEL. As such, flexibility, scalability, and accuracy. TELEMAC worked successfully with an extremely large TELEMAC, however, has a limit in handling mesh sizes mesh (up to 200 million elements) in an acceptable time. It, as we first notice in version v7p2r3. This version is capable of however, comes with great complexity in implementation. We dealing with domain sizes of a few million elements and scales first mentioned the memory problem in [4] and came up with
XXVI th Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019 a temporary solution, which has not treated the problem properly until this paper. The most efficient solution is given by [3], with which the implementation not only solved the memory problem effectively but also ran reasonably fast. The core of this improvement lays in utilizing a Hash table that was used to replace several memory critical variables in the old code. At this stage, as an ordinary user who usually works with a moderately large mesh would be satisfied with the solution provided by [3]. In this work, we tried to push the boundary of this improvement to the point that we are able to gain a better speed up while not losing the advantages of memory efficiency. Our first work was based on TELEMAC v7p2r3. We had a parallel version of PARTEL that fulfilled the need for the ViewBay project. This work however rendered obsolete as [2] and our latest work in this paper came out. Fig. 1: The workflow of TELEMAC Our latest work, therefore, focused on improving PARTEL Parallelizing PARTEL in TELEMAC is challenging, due in TELEMAC v8p0r0 by pushing the upper bound of domain to its convoluted structure. The lack of documentation makes decomposition in terms of performance. As the original the code even more difficult to comprehend in order to identify PARTEL code only runs in serial mode, it first seemed obvious which part of the code has the potential to be made parallel. to speed up the code by making it parallel. Second, we applied We start with finding the key data structures, their locations, different implementation for the Hash table used in the and their meaning as well as their influences in the rest of code. original PARTEL to find out the fastest, thus providing user After the analysis is done, we separate code that can be made with several options to Hash table implementations that best parallel and adjust for the impacts of changes. suite their needs. By combining parallelization and data As an overview, a simulation in TELEMAC is done in structure enhancement by an efficient Hash table, in our three steps Fig.1: (1) decomposing the domain, which is PARTEL code each MPI process maintains a low memory carried out by PARTEL; (2) running the simulation code on consumption while achieving a significant gain in each domain partition; (3) merging results from partitions. performance. Our PARTEL is in fact made embarrassingly These steps are controlled by the TELEMAC wrappers that parallel, therefore, theoretically possessing optimal scaling are several standalone Python code controlling the workflow. characteristics. The core of TELEMAC is, however, written in FORTRAN. M ETHOD II. B. Domain Decomposition in PARTEL A. TELEMAC workflow In parallel mode, TELEMAC applies the domain decomposition scheme with which the data associated with a Our aim is to finish the following two tasks: problem such as the representation of geometry, boundary (1) to translate the serial code of original PARTEL to an condition of the domain are decomposed. Each parallel task equivalent version that speeds up the process. then works on a partition of the domain data Parallelizing PARTEL is the optimal choice as it Domain partitions are the direct results of graph reflects the nature of the original code. This requires partitioning. Since a mesh representing a 2D domain is that the parallelized PARTEL must output the same as topologically a planar graph, it can be decomposed efficiently what it is expected from the serial counterpart. We call by a graph partitioning algorithm. For that reason, the graph this output consistency. partitioning code always runs first to create partition data. (2) to apply an efficient implementation for the Hash table PARTEL depends on several third-party tools for graph in the original PARTEL that possibly speed up the partitionings such as METIS, and SCOTCH [7,9]. While both process. tools are actively supported, METIS is selected by default. The tool is a set of serial programs for partitioning graphs as in Fig. 2, partitioning finite element meshes and producing fill reducing orderings for sparse matrices. It is worth mentioning that ParMETIS is an MPI-based parallel library that extends the functionality provided by METIS and includes routines that are especially suited for parallel computations and large scale numerical simulations. ParMETIS is, however, not usable in release versions of TELEMAC. The result of the graph partitioning is then used by following codes to create additional partition data such as boundary conditions, and parallel information. Finally, all
Recommend
More recommend