riffle optimized shuffle service for
play

Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data - PowerPoint PPT Presentation

Princeton University Facebook Haoyu Zhang Brian Cho Ergin Seyfe Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data Analytics Michael J. Freedman Batch analytics systems are widely used Large-scale SQL queries


  1. Princeton University Facebook Haoyu Zhang Brian Cho Ergin Seyfe Riffle: Optimized Shuffle Service for Avery Ching Large-Scale Data Analytics Michael J. Freedman

  2. Batch analytics systems are widely used • Large-scale SQL queries • Custom batch jobs • Pre-/Post-processing for ML At 10s of PB new data is generated every day for batch processing 100s of TB data is added to be processed by a single job 2

  3. Batch analytics jobs: logical graph narrow dependency wide dependency map filter join, filter groupBy map 3

  4. Batch analytics jobs: DAG execution plan Stage 1 Stage 2 • Shuffle: all-to-all communication between stages • >10x larger than available memory, strong fault tolerance requirements → on-disk shuffle files 4

  5. The case for tiny tasks • Benefits of slicing jobs into small tasks • Improve parallelism [Tinytasks HotOS 13] [Subsampling IC2E 14] [Monotask SOSP 17] • Improve load balancing [Sparrow SOSP 13] • Reduce straggler effect [Dolly NSDI 13] [SparkPerf NSDI 15] 5

  6. The case against tiny tasks Although we were able to run the Spark job with such a high number of tasks, we found that there is significant performance degradation when the number of tasks is too high. • Engineering experience often argues against running too many tasks • Medium scale → very large scale (10x larger than memory space) • Single-stage jobs → multi-stage jobs (> 50%) [*] Apache Spark @Scale: A 60 TB+ Production Use Case. https://tinyurl.com/yadx29gl 6

  7. Shuffle I/O grows quadratically with data ShuIIOe Time I/2 5eTuest SKuffle )etFK Size 5eTuest CRunt / 10 6 ShuIIOe Time (sec) 4000 1500 120 Size (KB) 3000 1000 80 2000 500 40 1000 0 0 0 0 5000 10000 0 5000 10000 1umber RI TasNs 1umber of TasNs • Large amount of fragmented I/O requests • Adversarial workload for hard drives! 7

  8. Strawman: tune number of tasks in a job • Tasks spill intermediate data to disk if data splits exceed memory capacity • Larger task execution reduces shuffle I/O, but increases spill I/O 8

  9. Strawman: tune number of tasks in a job 6huffle 6huffle 6huffle 6Sill 6Sill 6Sill 3000 3000 3000 7ime (sec) 7ime (sec) 7ime (sec) 2000 2000 2000 1000 1000 1000 0 0 0 300 300 300 400 400 400 500 500 500 600 600 600 700 700 700 800 800 800 900 900 900 1000 1000 1000 2000 2000 2000 4000 4000 4000 8000 8000 8000 10000 10000 10000 1umber of 0aS 7asNs 1umber of 0aS 7asNs 1umber of 0aS 7asNs • Need to retune when input data volume changes for each individual job • Bulky tasks can be detrimental [Dolly NSDI 13] [SparkPerf NSDI 15] [Monotask SOSP 17] • straggler problems, imbalanced workload, garbage collection overhead 9

  10. Large Amount of Small Tasks Fragmented Shuffle I/O Fewer, Sequential Bulky Tasks Shuffle I/O 10

  11. Riffle: optimized shuffle service Task Driver Task Worker Node Tasks assign Worker Node Worker Machine Task Task Task Task Job / Task report task Scheduler statuses Executor Executor send merge Riffle requests File System Merge report merge Scheduler Riffle Shuffle Service statuses • Riffle shuffle service: a long running instance on each physical node • Riffle scheduler: keeps track of shuffle files and issues merge requests 11

  12. Riffle: optimized shuffle service Application Driver • When receiving a merge request Merge Scheduler Worker-Side Merger 1. Combines small shuffle files into reduce map larger ones reduce merge 2. Keeps original file layout map request reduce map reduce map reduce • Reducers fetch fewer, large blocks merge map instead of many, small blocks reduce request map reduce Optimized Shuffle Service 12

  13. Results with merge operations on synthetic workload 0aS SWage 5educe SWage 5ead BlRcN 6ize 1umber Rf 5eads 500 6000 8000 5equesW CRunW 400 6ize (KB) Time (sec) 4500 6000 300 3000 4000 200 1500 2000 100 0 0 0 1R 0erge 5 10 20 40 1R 0erge 5 10 20 40 1-Way 0erge 1-Way 0erge • Riffle reduces number of fetch requests by 10x • Reduce stage -393s, map stage +169s → job completes 35% faster 13

  14. Best-effort merge: mixing merged and unmerged files 0aS SWage 0aS SWage 5educe SWage 5educe SWage 5ead BlRcN 6ize 5ead BlRcN 6ize 1umber Rf 5eads 1umber Rf 5eads 500 500 6000 6000 8000 8000 5equesW CRunW 5equesW CRunW 400 400 6ize (KB) 6ize (KB) Time (sec) Time (sec) 4500 4500 6000 6000 300 300 3000 3000 4000 4000 200 200 1500 1500 2000 2000 100 100 0 0 0 0 0 0 1R 0erge 1R 0erge 5 5 10 10 20 20 40 40 1R 0erge 1R 0erge 5 5 10 10 20 20 40 40 1-Way 0erge 1-Way 0erge 1-Way 0erge 1-Way 0erge Best-effort merge (95%) • Reduce stage -393s, map stage +52s → job completes 53% faster • Riffle finishes job with only ~50% of cluster resources! 14

  15. Additional enhancements • Handling merge operation failures • Efficient memory management • Balance merge requests in clusters 1 Merger request 1 Buffered Read Block 65-1 Job 1 Driver Buffered Write Merger Block 65-2 … … … … Job 2 Driver Merger Block 65 Block 65 Block 65- m Block 65 Merge Block 66 … Block 66 Block 66 … Block 66-1 Merger Block 67 Block 67 Block 67 Block 66-2 … … … Job k Driver … … request k Block 66- m k Merger 15

  16. Experiment setup • Testbed: Spark on a 100-node cluster • 56 CPU cores, 256GB RAM, 10Gbps Ethernet links • Each node runs 14 executors, each with 4 cores, 14GB RAM • Workload: 4 representative production jobs at Facebook Data Map Reduce Block Description 1 167.6 GB 915 200 983 K ad 2 1.15 TB 7,040 1,438 120 K measurement 3 2.7 TB 8,064 2,500 147 K measurement 4 267 TB 36,145 20,011 360 K ad 16

  17. Reduction in shuffle I/O requests 1R 0erJe 512K 10 20 40 6KuIIOe I2 5equests / 10 6 20 800 15 600 10 400 5 200 0 0 JRb1 JRb2 JRb3 JRb4 • Riffle reduces # of I/O requests by 5--10x for medium / large scale jobs 17

  18. Savings in end-to-end job completion time 1o 0erJe 512K 10 20 40 TotDl TDsN Execution 100 1200 Time / CPU Days Time / DDys 800 50 400 0 0 JoE1 JoE2 JoE3 JoE4 • Map stage time is almost not affected (with best-effort merge) • Reduces job completion time by 20--40% for medium / large jobs 18

  19. Conclusion • Shuffle I/O becomes scaling bottleneck for multi-stage jobs • Efficiently schedule merge operations, mitigate merge stragglers 1o 0erJe 512K 10 20 40 TotDl TDsN Execution 100 1200 Time / DDys merge 800 request 50 400 0 0 JoE1 JoE2 JoE3 JoE4 • Riffle is deployed for Facebook’s production jobs processing PBs of data 19

  20. Thanks! Haoyu Zhang haoyuz@cs.princeton.edu http://www.haoyuzhang.org

  21. Riffle merge policies Block 1 Block 1 Block 2 Block 2 … … Block 1 Block R Block 1 Block R Block 1 Block 1 Block 2 Block 2 Block 2 Block 2 … … Block R … … Block R … … Block 1 Block R Block R Block 1 Block 2 Block 2 … … total average block size N files Block R > merge threshold Block R 21

  22. Best-effort merge • Observation: slowdown in map stage is mostly due to stragglers Thread 1 Thread 2 Thread 3 time Merger • Best-effort merge: mixing merged and unmerged shuffle files • When number of finished merge requests is larger than a user specified percentage threshold, stop waiting for more merge results 22

Recommend


More recommend