Building the next Generation of MapReduce Programming Models over MPI to Fill the Gaps between Data Analytics and Supercomputers Michela Taufer University of Delaware Collaborators: Tao Gao, Boyu Zhang (University of Delaware) Pavan Balaji, Yanfei Guo (Argonne National Laboratory) BingQiang Wang, Yutong Lu (Guangzhou Supercomputer Center) Pietro Cicotti (San Diego Supercomputer Center) Yanjie Wei (Shenzhen Institute of Advanced Technologies)
1 MapReduce Programming Model • MapReduce runtime handles the parallel job execution, communication, and data movement • Users provide map and reduce functions Wordcount example: <Hello,1> <world,1> Hello <Hello, 2> reduce map World Shuffle Hello map reduce <World, 2> <Hello,1> World <world,1>
2 2 WordCount : A Concrete Example Key Value Map (When, 1), ( tweetle , 1 ), (beetles, 1), (fight, 1) When tweetle beetles fight, it’s called a tweetle beetle battle. (it’s, 1), (called, 1), (a, 1), (tweetle, 1), (beetle, 1), (battle, 1) And when they battle in a puddle, (And, 1), (when, 1), (they, 1), (battle, 1), (in, 1), (a, 1), (puddle, 1) it’s a tweetle beetle puddle battle. (it’s, 1), (a, 1), (tweetle, 1), (beetle, 1), (puddle, 1), (battle, 1) And when tweetle beetles battle with paddles in a puddle, (And, 1), (when, 1), (tweetle, 1), (beetles, 1), (battle, 1), (with, 1), (paddles, 1), (in, 1), (a, 1), (puddle, 1) They call it a tweetle beetle puddle paddle battle. (They, 1), (call, 1), (it, 1), (a, 1), (tweetle, 1), (beetle, 1), (puddle, 1), (paddle, 1), (battle, 1)
3 3 WordCount : A Concrete Example Reduce ( tweetle , 1 ), ( tweetle , 1 ), ( tweetle , 1 ), ( tweetle , 1 ), ( tweetle , 5 ) ( tweetle , 1 ) (When, 1) (When, 1) (When, 1), ( tweetle , 1 ), (beetles, 1), (fight, 1) (it’s, 1), (called, 1), (a, 1), ( tweetle , 1 ), (beetle, 1), (battle, 1) (when, 1), (when, 1) (when, 2) (battle, 1), (battle, 1), (battle, 1), (battle, 1), (battle, 5) (And, 1), (when, 1), (they, 1), (battle, 1), (in, 1), (a, 1), (puddle, 1) (battle, 1) (it’s, 1), (a, 1), ( tweetle , 1 ), (beetle, 1), (puddle, 1), (battle, 1) (puddle, 1), (puddle, 1), (puddle, 4) (puddle, 1), (puddle, 1) (beetle, 1), (beetle, 1), (And, 1), (when, 1), ( tweetle , 1 ), (beetles, 1), (battle, 1), (beetle, 3) (beetle, 1) (with, 1), (paddles, 1), (in, 1), (a, 1), (puddle, 1) (They, 1), (call, 1), (it, 1), (a, 1), ( tweetle , 1 ), (beetle, 1), (beetles, 1), (beetles, 1) (beetles, 2) (puddle, 1), (paddle, 1), (battle, 1)
4 Data Generation on HPC Systems From: https://xdmod.ccr.buffalo.edu
5 over MPI Is MapReduce an appealing way to handle big data processing on HPC systems?
6 Data Processing on HPC Systems • Key differences between Cloud computing and HPC systems disenfranchise the naïve used of Cloud method HPC systems Cloud computing systems processor disk processor Interconnect Ethernet disk processor array disk Hadoop/Spark MPI/OpenMP
7 A Fundamentally Correct MapReduce (MR) over MPI • Support logical map-shuffle-reduce workflow in four phases § Map , aggregate , convert , and reduce [1] <key, value> <key, value> <key, list<value>> output input aggregate P0 reduce convert map P1 map convert reduce … … … … … Pn map convert reduce barrier barrier barrier barrier [1] S. J Plimpton and K. D. Devine. MapReduce in MPI for Large-Scale Graph Algorithms. Parallel Computing, 37(9):610–632, 2011.
8 Extra Synchronizations • Aggregate and convert need users to explicit invocate them § Cost: extra synchronizations <key, value> <key, list<value>> output input aggregate P0 reduce convert map P1 map convert reduce … … … … … Pn map convert reduce barrier barrier barrier barrier
9 Extra Data Staging • Aggregate and convert need to store intermediate data § Cost: extra data staging <key, value> <key, list<value>> output input aggregate P0 reduce convert map P1 map convert reduce … … … … … Pn map convert reduce
10 Extra Memory Usage and Poor Data Management • Zoom in map / aggregate operations <key, value> <key, list<value>> output input aggregate P0 reduce convert map P1 map convert reduce … … … … … Pn map convert reduce barrier barrier barrier barrier zoom in
11 Extra Memory Usage and Poor Data Management • Allocation additional memory buffers for metadata § Cost: extra memory use • If in-memory buffer full à Spill data to the disk Static § Cost: poor data management Allocation staging area staging area send buffer receive buffer map P0 send buffer receive buffer map P1
12 Tackling Shortcomings of a Correct MR Model Shortcomings: extra synchronizations, extra data staging, extra memory use, and poor data management A journey to design and implement Mimir, an efficient MR over MPI framework • Memory inefficiency • Load balancing issues • I/O variability
13 Impact: Out-of-memory Operations • Existing MapReduce over MPI implementations still struggle with memory limits Can process only 4GB data in- memory on a 128GB node Out-of-memory processing Single-node execution time of WordCount (Wikipedia) with MR- MPI on Comet ( 128G memory ) [1] [1] T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M. Taufer. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In Proceedings of the IPDPS, 2017.
14 Reduce Synchronization and Extra Data Staging • Interleave operations: e.g., map interleaves with aggregate interleave <key,value> <key,list<value>> <key, value> input output aggregate P0 convert reduce map P1 map convert reduce … … … … … Pn map convert reduce barrier barrier barrier barrier Improvements: 1. Reduce synchronization; 2. Reduce extra data staging
15 Reduce Synchronization and Extra Data Staging • Interleave operations: e.g., map interleaves with aggregate interleave <key,value> <key,list<value>> <key, value> input output aggregate P0 convert reduce map P1 map convert reduce … … … … … Pn map convert reduce barrier barrier barrier
16 Optimizing Intermediate Data Management • Use send buffer as output of map directly § Avoid extra buffers usage • Use KV/KMV container as staging area Dynamic § Dynamically allocate one or multiple pages Allocation KV container Improvements: 3. Avoid extra map P0 memory buffer usage 4. Manage intermediate data more efficiently P1 map
17 Mimir vs. MR-MPI: WordCount on Comet • Single-node execution (24 processes, 128G memory) § Benchmarks: WC with Wikipedia dataset § Settings: MR-MPI (64M page and 512M page); Mimir (64M page) 4X 64X Mimir can handle 4X larger dataset [1] T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M. Taufer. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In Proceedings of the IPDPS, 2017.
18 Impact: Load Imbalance • <key,value> pairs are NOT distributed evenly among processes § Imbalanced <key,value> pairs may cause poor resource usage Number of <key,value> pairs Total time (sec) Mimir Process ids Execution time of WordCount (Wikipedia) with Number of <key,value> pairs for the Mimir on Tianhe-2 without load balancing WordCount (Wikipedia) on 768 processes
19 Impact: Load Imbalance • <key,value> pairs are NOT distributed evenly among processes § Imbalanced <key,value> pairs may cause poor resource usage Number of <key,value> pairs Balance ratio Mimir Process ids Balance ratio (Max mem / min mem for all Number of <key,value> pairs for the processes) for WordCount (Wikipedia) on Tianhe-2 WordCount (Wikipedia) on 768 processes
20 Combining <key,value> Pairs • Combiner operations: interleave <key, value> § Merge <key,value> pairs input with the same key before aggregate P0 shuffle map § Merge <key,value> pairs combine combine with the same key after P1 map shuffle • Application dependent: … … combine combine § Wordcount à YES Pn map § Join à NO combine combine
21 Combining <key,value> Pairs • M erge <key,value> pairs with the same key before shuffle • Merge <key,value> pairs with the same key after shuffle Wordcount example: <Hello,1> Hello <World,1> <Hello, 3> World <World,2> <Hello,2> reduce map Hi <Hi,1> World <World,1> <Hello,1> Hello Shuffle reduce map <World, 3> World <World,1> reduce <Hi, 1> <Hello,1> Hello map
22 Combiner Results: WordCount on Tianhe-2 1e8 Number of KV pairs Total time (sec) Mimir Mimir + combiner Memory usage (GB) Mimir Mimir Balance ratio Mimir + combiner Mimir + combiner WC T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M. Taufer. Skew Mitigation in MapReduce for Supercomputing Systems. In preparation, 2017.
Recommend
More recommend