No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior Software Engineer
Workers Time Google Cloud Platform 2
Google Cloud Platform 3
Plan 01 Intro 04 Autoscaling Setting the stage Why dynamic rebalancing really matters 02 Stragglers 05 If you remember two things Where they come from and Philosophy of everything above how people fight them 03 Dynamic rebalancing 1 How it works 2 Why is it hard
01 Intro Setting the stage
Google’s data processing timeline Dataflow Apache Beam MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010 2012 2014 2016 Google Cloud Platform 6
WordCount Pipeline p = Pipeline.create(options); p.apply( TextIO.Read.from ("gs://dataflow-samples/shakespeare/*")) .apply( FlatMapElements.via ( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply( Filter.byPredicate (word → !word.isEmpty())) .apply( Count.perElement ()) .apply( MapElements.via ( count → count.getKey() + ": " + count.getValue()) .apply( TextIO.Write.to ("gs://.../...")); p.run(); Google Cloud Platform 7
A B ParDo DoFn: A → [B] K, V K, [V] GroupByKey GBK MapReduce = ParDo + GroupByKey + ParDo Google Cloud Platform 8
Running a ParDo shard 1 DoFn shard 2 DoFn DoFn shard N DoFn Google Cloud Platform 9
Gantt charts Workers shard N Time Google Cloud Platform 10
Large WordCount: Read files, GroupByKey, Write files. 400 workers 20 minutes Google Cloud Platform 11
02 Stragglers Where they come from, and how people fight them
Stragglers Workers Time Google Cloud Platform 13
Amdahl’s law: it gets worse at scale #workers serial fraction Higher scale ⇒ More bottlenecked by serial parts. Google Cloud Platform 14
Where do stragglers come from? Uneven partitioning Uneven complexity Uneven resources Noise Process dictionary Join Foos / Bars, Bad machines Spuriously slow external in parallel by first letter: in parallel by Foos. RPCs Bad network ⅙ words start with ‘t’ ⇒ Some Foos have ≫ Bugs Resource contention < 6x speedup Bars than others. Google Cloud Platform 15
What would you do? Uneven partitioning Uneven complexity Uneven resources Noise Oversplit Backups Hand-tune Restarts Use data statistics Predictive ⇒ Unreliable Weak Google Cloud Platform 16
These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost always tuned wrong Statistics often missing / wrong Doesn’t exist for intermediate data Size != complexity Backups/restarts only address slow workers Google Cloud Platform 17
Upfront heuristics don’t work: will predict wrong. Higher scale → more likely. Google Cloud Platform Confidential & Proprietary 18
High scale triggers worst-case behavior. Corollary: If you’re bottlenecked by worst-case behavior, you won’t scale. Google Cloud Platform Confidential & Proprietary 19
03.1 Dynamic rebalancing How it works
Detect and fight stragglers Workers Time Google Cloud Platform 21
What is a straggler, really? Workers Slower than perfectly-parallel: t end > sum(t end ) / N Time Google Cloud Platform 22
Split stragglers, return residuals into pool of work Now Avg completion time 170 100 130 200 foo.txt (cheap, atomic) Workers keep running schedule 100 170 170 200 Time Google Cloud Platform 23
Rinse, repeat (“liquid sharding”) Now Avg completion time Workers Time Google Cloud Platform 24
1 ParDo ParDo/GBK/ParDo Skewed Uniform 24 workers 400 workers 50% 25% Google Cloud Platform 25
Adaptive > Predictive Get out of trouble > avoid trouble Google Cloud Platform Confidential & Proprietary 26
03.2 Dynamic rebalancing Why is it hard?
And that’s it? What’s so hard? Semantics Quality Making predictions Being sure it works What can be split? Wait-free Non-uniform density Testing consistency Data consistency Perfect granularity Stuckness Debugging Not just files “Dark matter” Measuring quality APIs Google Cloud Platform 28
What is splitting split at 170 foo.txt [100, 200) foo.txt [100, 170) foo.txt [170, 200) Google Cloud Platform 29
What is splitting: Associativity [A, B) + [B, C) = [A, C) Google Cloud Platform 30
What is splitting: Rounding up [A, B) = records starting in [A, B) Random access ⇒ Can split without scanning data! Google Cloud Platform 31
What is splitting: Rounding up squash vanilla grape apple rose beet pear lime kiwi fig [A, B) = records starting in [A, B) Random access [a, h) [h, s) [s, $) ⇒ Can split without scanning data! squash vanilla grape apple rose beet pear lime kiwi fig Google Cloud Platform 32
What is splitting: Blocks [A, B) = records in blocks starting in [A, B) Google Cloud Platform 33
What is splitting: Readers foo.txt [100, 200) split at 170 “Reader” Re-reading consistency: continue until EOF = re-read shard foo.txt [100, 170) foo.txt [170, 200) Google Cloud Platform 34
Dynamic splitting: readers ok not ok read not yet read X = last record read: Exact, Increasing e.g. can’t split an arbitrary SQL query Google Cloud Platform 35
[A, B) = blocks of records starting in [A, B) [A, B) + [B, C) = [A, C) Random access ⇒ No scanning needed to split Reading repeatable, ordered by position, positions exact Google Cloud Platform Confidential & Proprietary 36
Concurrency when splitting time ... Read Process should I split? ? ? ? Per-element processing While we wait, 1000s of workers idle. in O(hours) is common! Google Cloud Platform 37
Concurrency when splitting ... Read Process should I split? ? ? ? Per-element processing While we wait, 1000s of workers idle. in O(hours) is common! ... Read Process split! ok. split! ok. Split wait-free (but race-free), while processing/reading. see code: RangeTracker Google Cloud Platform 38
Perfectly granular splitting “Few records, heavy processing” is common. ⇒ Perfect parallelism required Google Cloud Platform 39
Separation: ParDo { record → sleep( ∞ ) } parallelized perfectly (requires wait-free + perfectly granular ) Google Cloud Platform Confidential & Proprietary 40
Separation is a qualitative improvement foo5.txt foo42.txt foo8.txt ParDo: ParDo: foo100.txt /path/to/foo*.txt expand read glob records foo91.txt perfectly parallel perfectly parallel foo26.txt over records over files foo87.txt See also: Splittable DoFn http://s.apache.org/splittable-do-fn foo56.txt infinite scalability (no “shard per file”) Google Cloud Platform 41
“Practical” solutions improve performance “No compromise” solutions reduce dimension of the problem space Google Cloud Platform Confidential & Proprietary 42
Google Cloud Platform 43
Making predictions: easy, right? 130 200 100 ~30% complete: 130 / [100, 200) = 0.3 Split at 70%: 0.7 [100, 200) = 170 t grape apple beet kiwi fig ~50% complete: k / [a, z) ≈ 0.5 Split at 70%: 0.7 [a, z) ≈ t 70% Google Cloud Platform 44
Easy; usually too good to be true. Progress Progress Progress 100% 100% 100% p x t t t t x t 100% Progress Progress Progress 100% 100% 100% t t t Google Cloud Platform 45
Accurate predictions = wrong goal, infeasible. Wildly off ⇒ System should still work Optimize for emergent behavior (separation) Better goal: detect stuckness Google Cloud Platform Confidential & Proprietary 46
Dark matter Heavy work that you don’t know exists, until you hit it. Goal: discover and distribute dark matter as quickly as possible. Google Cloud Platform 47 47 (Image credit: NASA)
04 Autoscaling Why dynamic rebalancing really matters
A lot of work ⇒ A lot of workers How much work will there be? Can’t predict: data size, complexity, etc. What should you do? Adaptive > Predictive. Keep re-estimating total work; scale up/down (Image credit: Wikipedia) 49 49
Start off with 3 workers, things are looking okay 10m Re-estimation ⇒ orders of magnitude more work: need 100 workers! 3 days 100 workers useless without 100 pieces of work! 92 workers idle 50
Autoscaling + dynamic rebalancing Now scaling up is no big deal! Add workers Work distributes itself Job smoothly scales 3 → 1000 workers. Waves of splitting Upscaling & VM startup Google Cloud Platform 51
05 If you remember two things Philosophy of everything above
If you remember two things Adaptive > Predictive Fighting stragglers > Preventing stragglers Emergent behavior > Local precision “No compromise” solutions matter Reducing dimension > Incremental improvement “Corner cases” are clues that you’re still compromising wait-free heavy records separation perfectly granular reading-as-ParDo rebalancing autoscaling reusability Google Cloud Platform 53
Thank you Q&A Google Cloud Platform Confidential & Proprietary 54
Recommend
More recommend