no shard left behind
play

No Shard Left Behind Straggler-free data processing in Cloud - PowerPoint PPT Presentation

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior Software Engineer Workers Time Google Cloud Platform 2 Google Cloud Platform 3 Plan 01 Intro 04 Autoscaling Setting the stage Why dynamic


  1. No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior Software Engineer

  2. Workers Time Google Cloud Platform 2

  3. Google Cloud Platform 3

  4. Plan 01 Intro 04 Autoscaling Setting the stage Why dynamic rebalancing really matters 02 Stragglers 05 If you remember two things Where they come from and Philosophy of everything above how people fight them 03 Dynamic rebalancing 1 How it works 2 Why is it hard

  5. 01 Intro Setting the stage

  6. Google’s data processing timeline Dataflow Apache Beam MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010 2012 2014 2016 Google Cloud Platform 6

  7. WordCount Pipeline p = Pipeline.create(options); p.apply( TextIO.Read.from ("gs://dataflow-samples/shakespeare/*")) .apply( FlatMapElements.via ( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply( Filter.byPredicate (word → !word.isEmpty())) .apply( Count.perElement ()) .apply( MapElements.via ( count → count.getKey() + ": " + count.getValue()) .apply( TextIO.Write.to ("gs://.../...")); p.run(); Google Cloud Platform 7

  8. A B ParDo DoFn: A → [B] K, V K, [V] GroupByKey GBK MapReduce = ParDo + GroupByKey + ParDo Google Cloud Platform 8

  9. Running a ParDo shard 1 DoFn shard 2 DoFn DoFn shard N DoFn Google Cloud Platform 9

  10. Gantt charts Workers shard N Time Google Cloud Platform 10

  11. Large WordCount: Read files, GroupByKey, Write files. 400 workers 20 minutes Google Cloud Platform 11

  12. 02 Stragglers Where they come from, and how people fight them

  13. Stragglers Workers Time Google Cloud Platform 13

  14. Amdahl’s law: it gets worse at scale #workers serial fraction Higher scale ⇒ More bottlenecked by serial parts. Google Cloud Platform 14

  15. Where do stragglers come from? Uneven partitioning Uneven complexity Uneven resources Noise Process dictionary Join Foos / Bars, Bad machines Spuriously slow external in parallel by first letter: in parallel by Foos. RPCs Bad network ⅙ words start with ‘t’ ⇒ Some Foos have ≫ Bugs Resource contention < 6x speedup Bars than others. Google Cloud Platform 15

  16. What would you do? Uneven partitioning Uneven complexity Uneven resources Noise Oversplit Backups Hand-tune Restarts Use data statistics Predictive ⇒ Unreliable Weak Google Cloud Platform 16

  17. These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost always tuned wrong Statistics often missing / wrong Doesn’t exist for intermediate data Size != complexity Backups/restarts only address slow workers Google Cloud Platform 17

  18. Upfront heuristics don’t work: will predict wrong. Higher scale → more likely. Google Cloud Platform Confidential & Proprietary 18

  19. High scale triggers worst-case behavior. Corollary: If you’re bottlenecked by worst-case behavior, you won’t scale. Google Cloud Platform Confidential & Proprietary 19

  20. 03.1 Dynamic rebalancing How it works

  21. Detect and fight stragglers Workers Time Google Cloud Platform 21

  22. What is a straggler, really? Workers Slower than perfectly-parallel: t end > sum(t end ) / N Time Google Cloud Platform 22

  23. Split stragglers, return residuals into pool of work Now Avg completion time 170 100 130 200 foo.txt (cheap, atomic) Workers keep running schedule 100 170 170 200 Time Google Cloud Platform 23

  24. Rinse, repeat (“liquid sharding”) Now Avg completion time Workers Time Google Cloud Platform 24

  25. 1 ParDo ParDo/GBK/ParDo Skewed Uniform 24 workers 400 workers 50% 25% Google Cloud Platform 25

  26. Adaptive > Predictive Get out of trouble > avoid trouble Google Cloud Platform Confidential & Proprietary 26

  27. 03.2 Dynamic rebalancing Why is it hard?

  28. And that’s it? What’s so hard? Semantics Quality Making predictions Being sure it works What can be split? Wait-free Non-uniform density Testing consistency Data consistency Perfect granularity Stuckness Debugging Not just files “Dark matter” Measuring quality APIs Google Cloud Platform 28

  29. What is splitting split at 170 foo.txt [100, 200) foo.txt [100, 170) foo.txt [170, 200) Google Cloud Platform 29

  30. What is splitting: Associativity [A, B) + [B, C) = [A, C) Google Cloud Platform 30

  31. What is splitting: Rounding up [A, B) = records starting in [A, B) Random access ⇒ Can split without scanning data! Google Cloud Platform 31

  32. What is splitting: Rounding up squash vanilla grape apple rose beet pear lime kiwi fig [A, B) = records starting in [A, B) Random access [a, h) [h, s) [s, $) ⇒ Can split without scanning data! squash vanilla grape apple rose beet pear lime kiwi fig Google Cloud Platform 32

  33. What is splitting: Blocks [A, B) = records in blocks starting in [A, B) Google Cloud Platform 33

  34. What is splitting: Readers foo.txt [100, 200) split at 170 “Reader” Re-reading consistency: continue until EOF = re-read shard foo.txt [100, 170) foo.txt [170, 200) Google Cloud Platform 34

  35. Dynamic splitting: readers ok not ok read not yet read X = last record read: Exact, Increasing e.g. can’t split an arbitrary SQL query Google Cloud Platform 35

  36. [A, B) = blocks of records starting in [A, B) [A, B) + [B, C) = [A, C) Random access ⇒ No scanning needed to split Reading repeatable, ordered by position, positions exact Google Cloud Platform Confidential & Proprietary 36

  37. Concurrency when splitting time ... Read Process should I split? ? ? ? Per-element processing While we wait, 1000s of workers idle. in O(hours) is common! Google Cloud Platform 37

  38. Concurrency when splitting ... Read Process should I split? ? ? ? Per-element processing While we wait, 1000s of workers idle. in O(hours) is common! ... Read Process split! ok. split! ok. Split wait-free (but race-free), while processing/reading. see code: RangeTracker Google Cloud Platform 38

  39. Perfectly granular splitting “Few records, heavy processing” is common. ⇒ Perfect parallelism required Google Cloud Platform 39

  40. Separation: ParDo { record → sleep( ∞ ) } parallelized perfectly (requires wait-free + perfectly granular ) Google Cloud Platform Confidential & Proprietary 40

  41. Separation is a qualitative improvement foo5.txt foo42.txt foo8.txt ParDo: ParDo: foo100.txt /path/to/foo*.txt expand read glob records foo91.txt perfectly parallel perfectly parallel foo26.txt over records over files foo87.txt See also: Splittable DoFn http://s.apache.org/splittable-do-fn foo56.txt infinite scalability (no “shard per file”) Google Cloud Platform 41

  42. “Practical” solutions improve performance “No compromise” solutions reduce dimension of the problem space Google Cloud Platform Confidential & Proprietary 42

  43. Google Cloud Platform 43

  44. Making predictions: easy, right? 130 200 100 ~30% complete: 130 / [100, 200) = 0.3 Split at 70%: 0.7 [100, 200) = 170 t grape apple beet kiwi fig ~50% complete: k / [a, z) ≈ 0.5 Split at 70%: 0.7 [a, z) ≈ t 70% Google Cloud Platform 44

  45. Easy; usually too good to be true. Progress Progress Progress 100% 100% 100% p x t t t t x t 100% Progress Progress Progress 100% 100% 100% t t t Google Cloud Platform 45

  46. Accurate predictions = wrong goal, infeasible. Wildly off ⇒ System should still work Optimize for emergent behavior (separation) Better goal: detect stuckness Google Cloud Platform Confidential & Proprietary 46

  47. Dark matter Heavy work that you don’t know exists, until you hit it. Goal: discover and distribute dark matter as quickly as possible. Google Cloud Platform 47 47 (Image credit: NASA)

  48. 04 Autoscaling Why dynamic rebalancing really matters

  49. A lot of work ⇒ A lot of workers How much work will there be? Can’t predict: data size, complexity, etc. What should you do? Adaptive > Predictive. Keep re-estimating total work; scale up/down (Image credit: Wikipedia) 49 49

  50. Start off with 3 workers, things are looking okay 10m Re-estimation ⇒ orders of magnitude more work: need 100 workers! 3 days 100 workers useless without 100 pieces of work! 92 workers idle 50

  51. Autoscaling + dynamic rebalancing Now scaling up is no big deal! Add workers Work distributes itself Job smoothly scales 3 → 1000 workers. Waves of splitting Upscaling & VM startup Google Cloud Platform 51

  52. 05 If you remember two things Philosophy of everything above

  53. If you remember two things Adaptive > Predictive Fighting stragglers > Preventing stragglers Emergent behavior > Local precision “No compromise” solutions matter Reducing dimension > Incremental improvement “Corner cases” are clues that you’re still compromising wait-free heavy records separation perfectly granular reading-as-ParDo rebalancing autoscaling reusability Google Cloud Platform 53

  54. Thank you Q&A Google Cloud Platform Confidential & Proprietary 54

Recommend


More recommend