heterogeneous computing systems
play

Heterogeneous Computing Systems Mikiko Sato Tokyo University of - PowerPoint PPT Presentation

MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology Background A A recent t tendency ncy for p performance nce improvements ements is s due to


  1. MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology

  2. 2 Background  A A recent t tendency ncy for p performance nce improvements ements is s due to increases in the number ers of CPU PU c cores with accelerato ators.  GPGPU, Intel XeonPhi cooperate Application T T T T T T T T Task Task Task Task Program T T T T T T T T T Many-core Task Multi-core Task (High-parallel Multi-core OS Many-core OS (I/O processing, computational (Linux) (Light-weight Kernel) Low-parallel & processing) high-latency Many-coreCPU Multi-coreCPU processing) c c c … core core c c c c c  The multi-core and many-core CPUs provide differing computational performance, parallelism, latency… Its ts importa tant is t issue is H How to to improve ve th the application performan ance using both th ty types o of CPUs cooperat ative vely.

  3. 3 MapReduce framework  Big dat ata a an anal alytics h has as been identified ed as as t the e exc xciting ar areas as f for both ac acad ademia a a and industry.  Ma MapR pReduce fram amework is a a po popu pular ar pr program amming f fram amework  for big data analytics and scientific computing [1]  MapReduce was originally designed for distributed-computing, and has been extended to various architectures. (HPC system, [2] GPGPUs [3] , many-core CPUs [4] )  Ma MapR pReduce on a a heterogeneo eous system with XeonPhi  The hardware features of the Xeon Phi achieve high performance (512- bit VPUs, MIMD thread parallelism, coherent L2 Cache, etc.)  The host processor assists the data transfer for MapReduce processing. [1] Welcome to Apache Hadoop (online), available from http://hadoop.apache.org. [2]"K MapReduce: A scalable tool for data-processing and search/ensemble applications on large- scale supercomputers", M. Matsuda, et.al., in CLUSTER , IEEE Computer Society, pp. 1-8, 2013. [3] "Mars: a mapreduce framework on graphics processors”, B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wnag, in PACT, pp. 1-8, 2008. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013.

  4. 4 Previous MapReduce frameworks on XeonPhi  MRPhi [4]  Optimized MapReduce framework for XeonPhi coprocessor  Using SIMD VPUs for map phase, SIMD hash computation algorithms, based on the MIMD hyper-threading, etc.  The pthread is used for Master/Worker task controls on Xeon Phi. Important issues for the performance are both utilizing of advanced XeonPhi-features and effective thread controls.  MrPhi [5]  The expanded version of MRPhi. [4] MapReduce operation and data are transferred separately from host to XeonPhi .  MPI communication is used for data transfer and synchronous control between host and XeonPhi. The communication overhead will be one of the factor of the MapReduce performance. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013. [5] "MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors", Lu, M., et al., IEEE Transactions on Parallel and Distributed Systems, vol.PP, no.99, pp.1-14, 2014.

  5. 5 Inter-task communications  Turn-ar around times ( (TAT) of a a n null function ca call in t the XenoPhi offload ading scheme ar are meas asured as as the referen ence e of our study  The communication overhead is large when sending the small size data between host and XeonPhi. → It is important to reduce the communication cost between host and XeonPhi as much as possible for the MapReduce performance. Local CPU Remote CPU Delegator Task Delegatee Turn-around time (TAT) are measured Task Polling Buf Write Request (8 ~ 128B) Buf Write Result (8B ) Polling The processing request data varies (Xeon E5-2670 , MPSS 3-2.1.6720-13) between 8 bytes and 128 bytes and the response data is fixed at 8 bytes.

  6. 6 Issues & Goal  In n order er t to o obtain h n high gh per erfo forma manc nce e on n the he hy hybrid- architec ecture s e sys ystem ems, i it is i imp mportant to  perform inter-task communication by less overhead  execute processing on the suitable CPU in consideration of the difference of performance and characteristic between CPUs Goal  En Enable e cooper eration b by y little e over erhea ead bet etwe ween en tasks ks for MapRed educe e frame mewo work k on a a h hyb ybrid s sys ystem em. .  In order to realize the program execution environment, “ Multiple PVAS ” ( Multiple Partitioned Virtual Address Space) will be provided as system software for task collaboration with less overhead on the hybrid-architecture system.

  7. a new task model towards many-core architecture (MES '13). 7 [1] Shimada, A., Gero, B., Hori, A. and Ishikawa, Y.: Proposing Task Model  A tas on PVAS [1]. ask model o of M M-PVAS is bas ased on  The PVAS system assigns one partition to one PVAS task.  PVAS tasks execute using each PVAS partition on a same PVAS address space. → PVAS Tas asks can an communicat ate b by Re Read ad/W /Write virtual al ad address on a a PVAS ad address spa pace, w without using an another shar ared m memory. PVAS Address Space PVAS Partition PVAS Task#1 Many-core CPU Export PVAS Task#2 PVAS Application Program TXT PVAS Task#3 TaskTaskTask Task ● ● ● DATA & BSS ● ● ● HEAP PVAS Task#M Memory STACK Kernel

  8. 8 M-PVAS Task Model  M-PVAS map aps a a number of of P PVAS a address s spa paces o onto a a single virtual address space, “ Mu Multipl ple PVAS Address Spa pace ”.  PVAS tasks belonging to the same Multiple PVAS address space can ac access o other PVAS ad address spa pace, even i if o on a a differen ent CPU. → M-PVAS Tas asks can an communicat ate w with an another M M-PVAS tas ask by just accessing to the v virtual address. It is convenient to develop the parallel program which collaborates between different CPUs.

  9. 9 Basic Design of M-PVAS MapReduce  M-PVAS MapReduce was designed based on MRPhi [3]  The same MapReduce processing model as MRPhi [3] Host sends the MapReduce Data to XeonPhi repeatedly.  Workers execute MapReduce operation with accessing each  part of the data.  Change the inter-task communication and the task control part to compare the performance gain when using pthread and MPI I/F or M-PVAS methods. (MRPhi): (MRPhi): MPI comm. pthread control vs vs (M-PVAS): (M-PVAS): Shared M-PVAS Task Address control Space

  10. 10 Master/Worker Task Control on M-PVAS  Ma Master T Tas ask controls Worker T Tas asks  Master Task notifies Worker Tasks of the MapReduce Control Data (fig. ① ) ← the he same as pthr hread  Master/Worker Tasks control synchronously using busy-waiting flags and an atomic counter(fig. ② , ③ ) ← the he simple flag sensing g will ② be e expected better performance ce ① ③ Processing information(Map or Reduce), The Number of Worker Tasks, MapReduce Data address, size, MapReduce Result Data address, etc.

  11. 11 Data transfer for MapReduce processing  Non-blocking data transfer is employed by both Sender Task on Host System and Master Task on Many-core System  Sender Task gets the request from Ma Master r Task and t transfer ers s data  The double buffering g requires es two buffers, , with one u used to receive e the next d data chunk w while the other to pr process ss the he current data chunk.  Worker ers s divide e the he Receive e buffe fer r data and e execute each Map processi ssing. g. With this control, computation ion and data transfer er can be overlapped ped and will be e expecte ted d bet ette ter r performan ance

  12. 12 Implementations of Data transfer  M-PVAS  Master writes the buffer address and size information on Master address space, and Sender checks them and memory copy using “ memcpy ()” function simply.  MR MRPhi  MRPhi uses MPI_Irecv(), MPI_wait() functions to get data asynchronously.

  13. 13 Evaluation  Execution environment for M-PVAS MapReduce  XeonPhi : : Ma Master Tas ask = = 1, 1, Worker Tas asks=2 =239 39  Host ( Xeon ) : Sender Tas ask = = 1  Benchmark  Mo Monte Car arlo that at shows good pe performan ance on XeonPhi. Many- CPU Intel Xeon Phi 5110P (60 cores, core 240 threads, 1.053GHz) Memory GDDR5 8GB OS Linux 2.6.38 Multi-core CPU Intel Xeon E5-2650 x2 (8 cores, 16 threads, 2.6GHz) Memory DDR3 64GB OS Linux 2.6.32 (CentOS 6.3) Intel CCL MPSS Version 3.4.3 MPI IMPI Version 5.0.1.035

  14. 14 Summary  In this study, the task execution model “Multiple Partitioned Virtual Address Space (M- PVAS)” is applied to the MapReduce framework.  The effect of the M-PVAS model is estimated by the MapReduce benchmark, Monte Carlo.  At the current state M-PVAS MapReduce shows better performance than the original MapReduce framework. • M-PVAS achieves around 1.8 ~ 2.0 times speedup. • The main factor is data transfer processing.  Future Work  investigate a factor of performance improvement more deeply.  experiment on different benchmarks.

Recommend


More recommend