T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity
D ata ata- P arallel arallel A pplicati pplication on Data-parallel applications emerge and rapidly increase in past 10 years • Google processes about 24 petabytes of data per day in 2008 • The movie tar” is takes over 1 petabyte of local storage for 3D rendering * • … * http://www.information- management.com/newsletters/avatar_data_processing-10016774-1.html
D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from
D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Functionality Parallelism Data Distribution Fault Tolerance programmer Load Balance
D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Two Primitive: Map ( input ) Functionality Reduce ( key , values ) MapReduce Runtime programmer
D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Two Primitive: Word Count Map ( input ) for each word in input emit ( word , 1 ) Functionality Reduce ( key , values ) MapReduce int sum = 0; Runtime for each value in values programmer sum += value ; emit ( word , sum )
M ul ultic ticore ore Multicore is commercially prevalent recently • Quad-cores and eight cores on a chip are common, • Tens and hundreds of cores on a single chip will appear in near feature 1X 1X 4X 8X 64X 64X
M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA
M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA Features > Parallelism: threads > Communication: shared address space
M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime Heavily optimized runtime for shared-memory > Runtime algorithm > CMPs and SMPs e.g. locality-aware task > NUMA distribution > Scalable data structure e.g. hash table Features > OS Interaction > Parallelism: threads e.g. memory allocator, > Communication: thread pool shared address space
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Main Memory
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input Main Memory ......... ......... Input ......... Buffer ......... Load ......... ......... ......... ......... ......... .........
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input Main Memory ......... ......... .. .. .. .. .. .. .. .. ......... ......... ......... ......... Intermediate Buffer ......... ......... ......... .........
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... 1 1 1 1 1 ..boy.... .. .. ......... ......... value ......... key array array
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... Final Buffer ......... ..boy.... ......... ......... .........
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... ..boy.... 5 .. ......... ......... .........
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... Merge ..boy.... Output ......... Result Buffer ......... .........
I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Input Output M M M M Main Memory R R R R .. .. .. .. .. .. .. .. .. Free Write File Merge End
D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re
D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data)
D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data) Poor data locality • Process all input data at one time e.g. WordCount with 4GB input has about 25% L2 cache miss rate
D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data) Poor data locality • Process all input data at one time e.g. WordCount with 4GB input has about 25% L2 cache miss rate Strict dependency barriers • CPU idle at the exchange of phases
D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time Poor data locality S ol ion : : T il ed- M ap ap R edu olut utio iled educe ce • Process all input data at one time Strict dependency barriers • CPU idle at the exchange of phases
C ont ontributi ribution on Tiled-MapReduce programming model − Tiling strategy − Fault tolerance (in paper) Three optimizations for Tiled-MapReduce runtime − Input Data Buffer Reuse − NUCA/NUMA-aware Scheduler − Software Pipeline
O utl utline ine 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi imizatio zation on . E val 3. valuatio uation . C on 4. oncl clusio usion
O utl utline ine 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi imizatio zation on . E val 3. valuatio uation . C on 4. oncl clusio usion
T iled iled- M ap ap R edu educe ce “ Tiling Strategy” • Divide a large MapReduce job into a number of independent small sub-jobs • Iteratively process one sub-job at a time
T iled iled- M ap ap R edu educe ce “ Tiling Strategy” • Divide a large MapReduce job into a number of independent small sub-jobs • Iteratively process one sub-job at a time Requirement • Reduce function must be Commutative and Associative • all 26 applications in the test suit of Phoenix and Hadoop meet the requirement
T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start Map Reduce Merge End
T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map Reduce Reduce Merge End
T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Reduce iteration Reduce Merge End
T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Combine iteration 3. Rename the Reduce phase within Reduce loop to the Combine phase Merge End
T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Combine iteration 3. Rename the Reduce phase within Reduce loop to the Combine phase Merge 4. Modify the Reduce phase to process the partial results of all End iterations
P rot pe of of T iled iled- M ap ap R edu rototy otype educe ce Ostrich : a prototype of Tiled-MapReduce programming model • Demonstrate the effectiveness of TMR programming model • Base on Phoenix runtime • Follow the data structure and algorithms
O str ich I mplem strich mplementa entation tion Disk Processors Start Worker Threads Input Main Memory ......... ......... .. .. .. .. .. .. .. .. ......... ......... Load ......... ......... Intermediate Buffer ......... ......... ......... .........
O str ich I mplem strich mplementa entation tion Disk Processors Start Worker Threads Input M M M M Main Memory ......... ......... Iteration .. .. .. .. .. .. .. .. ......... window ......... ......... ......... ......... ......... ......... .........
Recommend
More recommend