On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro Álvarez-Melcón, Fernando D. Quesada, Domingo Giménez, Carlos Pérez-Alcaraz, Tomás Ramírez, and José Ginés Picón alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informatica Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas Signal Theory and Communications 08-12 April 2013 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 1 / 18
Outline Introduction and motivation 1 Computation of Green’s functions on hybrid systems 2 Parallelization in CC-NUMA at MoM level of a VIE technique 3 Autotuning parallel codes 4 Conclusions 5 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18
Outline Introduction and motivation 1 Computation of Green’s functions on hybrid systems 2 Parallelization in CC-NUMA at MoM level of a VIE technique 3 Autotuning parallel codes 4 Conclusions 5 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18
Outline Introduction and motivation 1 Computation of Green’s functions on hybrid systems 2 Parallelization in CC-NUMA at MoM level of a VIE technique 3 Autotuning parallel codes 4 Conclusions 5 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18
Outline Introduction and motivation 1 Computation of Green’s functions on hybrid systems 2 Parallelization in CC-NUMA at MoM level of a VIE technique 3 Autotuning parallel codes 4 Conclusions 5 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18
Outline Introduction and motivation 1 Computation of Green’s functions on hybrid systems 2 Parallelization in CC-NUMA at MoM level of a VIE technique 3 Autotuning parallel codes 4 Conclusions 5 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18
Introduction and motivation Motivation of the work High interest in the development of full-wave techniques based on 1 Integral Equation formulations for the analysis of microwave components and antennas. Need of efficient software tools that allow optimization of complex 2 devices in real time. Complexity of devices increases computational time as the cube 3 of the problem size. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18
Introduction and motivation Motivation of the work High interest in the development of full-wave techniques based on 1 Integral Equation formulations for the analysis of microwave components and antennas. Need of efficient software tools that allow optimization of complex 2 devices in real time. Complexity of devices increases computational time as the cube 3 of the problem size. Identification of bottle-necks Two important elements in integral equation formulations: Calculation of Green’s functions inside waveguides maybe slow 1 due to low convergence rate of series (images, modes). In Volume Integral Equation formulations, size of the MoM 2 matrices increases as N 3 . Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18
Introduction and motivation Objectives of the work Increase efficiency using parallel computing. 1 Application of several hybrid-heterogeneous parallelism strategies 2 is proposed in this context. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18
Introduction and motivation Objectives of the work Increase efficiency using parallel computing. 1 Application of several hybrid-heterogeneous parallelism strategies 2 is proposed in this context. Strategies explored At a low level, application of hybrid parallelism 1 (MPI+OpenMP+CUDA) for the computation of Green’s functions in rectangular waveguides. At a higher level, combination of two level parallelism (OpenMP 2 and MKL multithread routines) in cc-NUMA systems applied to accelerate MoM solutions in VIE formulation. Possibilities to use autotuning strategies. 3 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18
Computation of Green’s functions on hybrid systems Hybrid parallelism MPI+OpenMP , OpenMP+CUDA and MPI+OpenMP+CUDA 1 routines are developed to accelerate the calculation of 2D waveguide Green’s functions. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18
Computation of Green’s functions on hybrid systems Hybrid parallelism MPI+OpenMP , OpenMP+CUDA and MPI+OpenMP+CUDA 1 routines are developed to accelerate the calculation of 2D waveguide Green’s functions. As seen, ( p ) MPI processes For each MPI process P k , 0 ≤ k < p : are started. omp_set_num_threads( h + g ) In addition, ( h + g ) threads for i = k m p to ( k + 1 ) m p − 1 do run inside each process. node = omp_get_thread_num() if node < h then Threads ( 0 ) to ( h − 1 ) works Compute with OpenMP thread on the CPU (OpenMP , OMP). else Call to CUDA kernel Remaining threads from ( h ) end if to ( h + g − 1 ) works in GPU end for calling CUDA kernels. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18
Computation of Green’s functions on hybrid systems Routines developed 1 + 0 h + 0 0 + g p \ h + g h + g 1 SEQ OMP CUDA OMP+CUDA MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA p Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18
Computation of Green’s functions on hybrid systems Routines developed 1 + 0 h + 0 0 + g p \ h + g h + g 1 SEQ OMP CUDA OMP+CUDA MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA p Computational systems tested Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores; machines are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18
Computation of Green’s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18
Computation of Green’s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Plot is presented as a function of the prob- lem size (#images, #points). S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18
Computation of Green’s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p , h and g produces lower execution times that blind GPU use. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18
Computation of Green’s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p , h and g produces lower execution times that blind GPU use. Plot is presented as a function of the prob- lem size (#images, #points). S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large prob- lems using optimum. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18
Computation of Green’s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p , h and g for different nodes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18
Computation of Green’s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p , h and g for different nodes. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18
Recommend
More recommend