MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology
2 Background A A recent t tendency ncy for p performance nce improvements ements is s due to increases in the number ers of CPU PU c cores with accelerato ators. GPGPU, Intel XeonPhi cooperate Application T T T T T T T T Task Task Task Task Program T T T T T T T T T Many-core Task Multi-core Task (High-parallel Multi-core OS Many-core OS (I/O processing, computational (Linux) (Light-weight Kernel) Low-parallel & processing) high-latency Many-coreCPU Multi-coreCPU processing) c c c … core core c c c c c The multi-core and many-core CPUs provide differing computational performance, parallelism, latency… Its ts importa tant is t issue is H How to to improve ve th the application performan ance using both th ty types o of CPUs cooperat ative vely.
3 MapReduce framework Big dat ata a an anal alytics h has as been identified ed as as t the e exc xciting ar areas as f for both ac acad ademia a a and industry. Ma MapR pReduce fram amework is a a po popu pular ar pr program amming f fram amework for big data analytics and scientific computing [1] MapReduce was originally designed for distributed-computing, and has been extended to various architectures. (HPC system, [2] GPGPUs [3] , many-core CPUs [4] ) Ma MapR pReduce on a a heterogeneo eous system with XeonPhi The hardware features of the Xeon Phi achieve high performance (512- bit VPUs, MIMD thread parallelism, coherent L2 Cache, etc.) The host processor assists the data transfer for MapReduce processing. [1] Welcome to Apache Hadoop (online), available from http://hadoop.apache.org. [2]"K MapReduce: A scalable tool for data-processing and search/ensemble applications on large- scale supercomputers", M. Matsuda, et.al., in CLUSTER , IEEE Computer Society, pp. 1-8, 2013. [3] "Mars: a mapreduce framework on graphics processors”, B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wnag, in PACT, pp. 1-8, 2008. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013.
4 Previous MapReduce frameworks on XeonPhi MRPhi [4] Optimized MapReduce framework for XeonPhi coprocessor Using SIMD VPUs for map phase, SIMD hash computation algorithms, based on the MIMD hyper-threading, etc. The pthread is used for Master/Worker task controls on Xeon Phi. Important issues for the performance are both utilizing of advanced XeonPhi-features and effective thread controls. MrPhi [5] The expanded version of MRPhi. [4] MapReduce operation and data are transferred separately from host to XeonPhi . MPI communication is used for data transfer and synchronous control between host and XeonPhi. The communication overhead will be one of the factor of the MapReduce performance. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013. [5] "MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors", Lu, M., et al., IEEE Transactions on Parallel and Distributed Systems, vol.PP, no.99, pp.1-14, 2014.
5 Inter-task communications Turn-ar around times ( (TAT) of a a n null function ca call in t the XenoPhi offload ading scheme ar are meas asured as as the referen ence e of our study The communication overhead is large when sending the small size data between host and XeonPhi. → It is important to reduce the communication cost between host and XeonPhi as much as possible for the MapReduce performance. Local CPU Remote CPU Delegator Task Delegatee Turn-around time (TAT) are measured Task Polling Buf Write Request (8 ~ 128B) Buf Write Result (8B ) Polling The processing request data varies (Xeon E5-2670 , MPSS 3-2.1.6720-13) between 8 bytes and 128 bytes and the response data is fixed at 8 bytes.
6 Issues & Goal In n order er t to o obtain h n high gh per erfo forma manc nce e on n the he hy hybrid- architec ecture s e sys ystem ems, i it is i imp mportant to perform inter-task communication by less overhead execute processing on the suitable CPU in consideration of the difference of performance and characteristic between CPUs Goal En Enable e cooper eration b by y little e over erhea ead bet etwe ween en tasks ks for MapRed educe e frame mewo work k on a a h hyb ybrid s sys ystem em. . In order to realize the program execution environment, “ Multiple PVAS ” ( Multiple Partitioned Virtual Address Space) will be provided as system software for task collaboration with less overhead on the hybrid-architecture system.
a new task model towards many-core architecture (MES '13). 7 [1] Shimada, A., Gero, B., Hori, A. and Ishikawa, Y.: Proposing Task Model A tas on PVAS [1]. ask model o of M M-PVAS is bas ased on The PVAS system assigns one partition to one PVAS task. PVAS tasks execute using each PVAS partition on a same PVAS address space. → PVAS Tas asks can an communicat ate b by Re Read ad/W /Write virtual al ad address on a a PVAS ad address spa pace, w without using an another shar ared m memory. PVAS Address Space PVAS Partition PVAS Task#1 Many-core CPU Export PVAS Task#2 PVAS Application Program TXT PVAS Task#3 TaskTaskTask Task ● ● ● DATA & BSS ● ● ● HEAP PVAS Task#M Memory STACK Kernel
8 M-PVAS Task Model M-PVAS map aps a a number of of P PVAS a address s spa paces o onto a a single virtual address space, “ Mu Multipl ple PVAS Address Spa pace ”. PVAS tasks belonging to the same Multiple PVAS address space can ac access o other PVAS ad address spa pace, even i if o on a a differen ent CPU. → M-PVAS Tas asks can an communicat ate w with an another M M-PVAS tas ask by just accessing to the v virtual address. It is convenient to develop the parallel program which collaborates between different CPUs.
9 Basic Design of M-PVAS MapReduce M-PVAS MapReduce was designed based on MRPhi [3] The same MapReduce processing model as MRPhi [3] Host sends the MapReduce Data to XeonPhi repeatedly. Workers execute MapReduce operation with accessing each part of the data. Change the inter-task communication and the task control part to compare the performance gain when using pthread and MPI I/F or M-PVAS methods. (MRPhi): (MRPhi): MPI comm. pthread control vs vs (M-PVAS): (M-PVAS): Shared M-PVAS Task Address control Space
10 Master/Worker Task Control on M-PVAS Ma Master T Tas ask controls Worker T Tas asks Master Task notifies Worker Tasks of the MapReduce Control Data (fig. ① ) ← the he same as pthr hread Master/Worker Tasks control synchronously using busy-waiting flags and an atomic counter(fig. ② , ③ ) ← the he simple flag sensing g will ② be e expected better performance ce ① ③ Processing information(Map or Reduce), The Number of Worker Tasks, MapReduce Data address, size, MapReduce Result Data address, etc.
11 Data transfer for MapReduce processing Non-blocking data transfer is employed by both Sender Task on Host System and Master Task on Many-core System Sender Task gets the request from Ma Master r Task and t transfer ers s data The double buffering g requires es two buffers, , with one u used to receive e the next d data chunk w while the other to pr process ss the he current data chunk. Worker ers s divide e the he Receive e buffe fer r data and e execute each Map processi ssing. g. With this control, computation ion and data transfer er can be overlapped ped and will be e expecte ted d bet ette ter r performan ance
12 Implementations of Data transfer M-PVAS Master writes the buffer address and size information on Master address space, and Sender checks them and memory copy using “ memcpy ()” function simply. MR MRPhi MRPhi uses MPI_Irecv(), MPI_wait() functions to get data asynchronously.
13 Evaluation Execution environment for M-PVAS MapReduce XeonPhi : : Ma Master Tas ask = = 1, 1, Worker Tas asks=2 =239 39 Host ( Xeon ) : Sender Tas ask = = 1 Benchmark Mo Monte Car arlo that at shows good pe performan ance on XeonPhi. Many- CPU Intel Xeon Phi 5110P (60 cores, core 240 threads, 1.053GHz) Memory GDDR5 8GB OS Linux 2.6.38 Multi-core CPU Intel Xeon E5-2650 x2 (8 cores, 16 threads, 2.6GHz) Memory DDR3 64GB OS Linux 2.6.32 (CentOS 6.3) Intel CCL MPSS Version 3.4.3 MPI IMPI Version 5.0.1.035
14 Summary In this study, the task execution model “Multiple Partitioned Virtual Address Space (M- PVAS)” is applied to the MapReduce framework. The effect of the M-PVAS model is estimated by the MapReduce benchmark, Monte Carlo. At the current state M-PVAS MapReduce shows better performance than the original MapReduce framework. • M-PVAS achieves around 1.8 ~ 2.0 times speedup. • The main factor is data transfer processing. Future Work investigate a factor of performance improvement more deeply. experiment on different benchmarks.
Recommend
More recommend