outline out line
play

Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O - PowerPoint PPT Presentation

T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity D ata ata- P arallel


  1. T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity

  2. D ata ata- P arallel arallel A pplicati pplication on Data-parallel applications emerge and rapidly increase in past 10 years • Google processes about 24 petabytes of data per day in 2008 • The movie tar” is takes over 1 petabyte of local storage for 3D rendering * • … * http://www.information- management.com/newsletters/avatar_data_processing-10016774-1.html

  3. D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from

  4. D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Functionality Parallelism Data Distribution Fault Tolerance programmer Load Balance

  5. D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Two Primitive: Map ( input ) Functionality Reduce ( key , values ) MapReduce Runtime programmer

  6. D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Two Primitive: Word Count Map ( input ) for each word in input emit ( word , 1 ) Functionality Reduce ( key , values ) MapReduce int sum = 0; Runtime for each value in values programmer sum += value ; emit ( word , sum )

  7. M ul ultic ticore ore Multicore is commercially prevalent recently • Quad-cores and eight cores on a chip are common, • Tens and hundreds of cores on a single chip will appear in near feature 1X 1X 4X 8X 64X 64X

  8. M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA

  9. M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA Features > Parallelism: threads > Communication: shared address space

  10. M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime Heavily optimized runtime for shared-memory > Runtime algorithm > CMPs and SMPs e.g. locality-aware task > NUMA distribution > Scalable data structure e.g. hash table Features > OS Interaction > Parallelism: threads e.g. memory allocator, > Communication: thread pool shared address space

  11. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Main Memory

  12. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input Main Memory ......... ......... Input ......... Buffer ......... Load ......... ......... ......... ......... ......... .........

  13. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input Main Memory ......... ......... .. .. .. .. .. .. .. .. ......... ......... ......... ......... Intermediate Buffer ......... ......... ......... .........

  14. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... 1 1 1 1 1 ..boy.... .. .. ......... ......... value ......... key array array

  15. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... Final Buffer ......... ..boy.... ......... ......... .........

  16. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... ..boy.... 5 .. ......... ......... .........

  17. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... Merge ..boy.... Output ......... Result Buffer ......... .........

  18. I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Input Output M M M M Main Memory R R R R .. .. .. .. .. .. .. .. .. Free Write File Merge End

  19. D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re

  20. D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data)

  21. D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data) Poor data locality • Process all input data at one time e.g. WordCount with 4GB input has about 25% L2 cache miss rate

  22. D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data) Poor data locality • Process all input data at one time e.g. WordCount with 4GB input has about 25% L2 cache miss rate Strict dependency barriers • CPU idle at the exchange of phases

  23. D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time Poor data locality S ol ion : : T il ed- M ap ap R edu olut utio iled educe ce • Process all input data at one time Strict dependency barriers • CPU idle at the exchange of phases

  24. C ont ontributi ribution on Tiled-MapReduce programming model − Tiling strategy − Fault tolerance (in paper) Three optimizations for Tiled-MapReduce runtime − Input Data Buffer Reuse − NUCA/NUMA-aware Scheduler − Software Pipeline

  25. O utl utline ine 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi imizatio zation on . E val 3. valuatio uation . C on 4. oncl clusio usion

  26. O utl utline ine 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi imizatio zation on . E val 3. valuatio uation . C on 4. oncl clusio usion

  27. T iled iled- M ap ap R edu educe ce “ Tiling Strategy” • Divide a large MapReduce job into a number of independent small sub-jobs • Iteratively process one sub-job at a time

  28. T iled iled- M ap ap R edu educe ce “ Tiling Strategy” • Divide a large MapReduce job into a number of independent small sub-jobs • Iteratively process one sub-job at a time Requirement • Reduce function must be Commutative and Associative • all 26 applications in the test suit of Phoenix and Hadoop meet the requirement

  29. T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start Map Reduce Merge End

  30. T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map Reduce Reduce Merge End

  31. T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Reduce iteration Reduce Merge End

  32. T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Combine iteration 3. Rename the Reduce phase within Reduce loop to the Combine phase Merge End

  33. T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Combine iteration 3. Rename the Reduce phase within Reduce loop to the Combine phase Merge 4. Modify the Reduce phase to process the partial results of all End iterations

  34. P rot pe of of T iled iled- M ap ap R edu rototy otype educe ce Ostrich : a prototype of Tiled-MapReduce programming model • Demonstrate the effectiveness of TMR programming model • Base on Phoenix runtime • Follow the data structure and algorithms

  35. O str ich I mplem strich mplementa entation tion Disk Processors Start Worker Threads Input Main Memory ......... ......... .. .. .. .. .. .. .. .. ......... ......... Load ......... ......... Intermediate Buffer ......... ......... ......... .........

  36. O str ich I mplem strich mplementa entation tion Disk Processors Start Worker Threads Input M M M M Main Memory ......... ......... Iteration .. .. .. .. .. .. .. .. ......... window ......... ......... ......... ......... ......... ......... .........

Recommend


More recommend