NEPTUNE Scheduling Suspendable Tasks for Unified Stream/Batch Applications Panagiotis Garefalakis Imperial College London pgaref@imperial.ac.uk Konstantinos Karanasos Peter Pietzuch Microsoft Imperial College London kokarana@microsoft.com prp@imperial.ac.uk SoCC, Santa Cruz, California, November 2019
Unified application example Inference Job Real-time Low-latency data responses Training Job Stream Historical Trained Batch Application data Model Iterate Panagiotis Garefalakis - Imperial College London 2
Evolution of analytics frameworks Frameworks Batch frameworks Unified with hybrid stream/batch stream/batch frameworks Stream frameworks applications 2010 2014 2018 Structured Streaming Panagiotis Garefalakis - Imperial College London 3
Stream/Batch application requirements Requirements > Latency: Execute inference job with minimum delay > Throughput: Batch jobs should not be compromised > Efficiency: Achieve high cluster resource utilization Challenge: schedule stream/batch jobs to satisfy their diverse requirements Panagiotis Garefalakis - Imperial College London 4
Stream/Batch application scheduling Driver App Context submit Application run job Code DAG Scheduler Stage1 Stage2 2x 2x 2xT 2xT T T Inference (stream) Job Stage1 Stage2 3x 4x 3T Training (batch) Job T 3T T 3T T 3T Panagiotis Garefalakis - Imperial College London 5
Stream/Batch application scheduling Stage1 Stage2 2x 2x 2xT 2xT T T Inference (stream) Job Stage1 Stage2 3x 4x 3T Training (batch) Job T 3T T 3T T 3T > Static allocation: dedicate resources to each job executor 2 T T T T Cores T executor 1 3T Wasted resources T 3T 3T T 3T 6T 2T 4T 8T Resources can not be shared across jobs Panagiotis Garefalakis - Imperial College London 6
Stream/Batch application scheduling Stage1 Stage2 2x 2x 2xT 2xT T T Inference (stream) Job Stage1 Stage2 3x 4x 3T Training (batch) Job T 3T T 3T T 3T > FIFO: first job runs to completion shared executors 3T T Cores 3T T 3T T T 3T T T T 6T 2T 4T 8T Long batch jobs increase stream job latency Panagiotis Garefalakis - Imperial College London 7
Stream/Batch application scheduling Stage1 Stage2 2x 2x 2xT 2xT T T Inference (stream) Job Stage1 Stage2 3x 4x 3T Training (batch) Job T 3T T 3T T 3T > FAIR: weight share resources across jobs shared executors T 3T Cores T T 3T 3T T T 3T T T 6T 2T 4T 8T Better packing with non-optimal latency queuing Panagiotis Garefalakis - Imperial College London 8
Stream/Batch application scheduling Stage1 Stage2 2x 2x 2xT 2xT T T Inference (stream) Job Stage1 Stage2 3x 4x 3T Training (batch) Job T 3T T 3T T 3T > KILL: avoid queueing by preempting batch tasks shared executors T 3T T Cores T 3T T 3T T 3T T 3T 3T T 6T 2T 4T 8T Better latency at the expense of extra work Panagiotis Garefalakis - Imperial College London 9
Stream/Batch application scheduling Stage1 Stage2 2x 2x 2xT 2xT T T Inference (stream) Job Stage1 Stage2 3x 4x 3T Training (batch) Job T 3T T 3T T 3T > NEPTUNE: minimize queueing and wasted work! shared executors T 3T T Cores T 3T T T 3T 2T 3T T 2T T 6T 2T 4T 8T Panagiotis Garefalakis - Imperial College London 10
Challenges > How to minimize queuing for latency-sensitive jobs and wasted work? Implement suspendable tasks > How to natively support stream/batch applications? Provide a unified execution framework > How to satisfy different stream/batch application requirements and high-level objectives? Introduces custom scheduling policies Panagiotis Garefalakis - Imperial College London 11
NEPTUNE Execution framework for Stream/Batch applications > How to minimize queuing for latency-sensitive jobs and wasted work? Implement suspendable tasks Support suspendable tasks > How to natively support stream/batch applications? Provide a unified execution framework Unified execution framework on top of Structured Streaming > How to satisfy different stream/batch application requirements and high-level objectives? Introduces custom scheduling policies Introduce pluggable scheduling policies Panagiotis Garefalakis - Imperial College London 12
Typical tasks > Tasks: apply a function to a partition of data Executor Stack > Subroutines that run in executor to completion State Iterator > Preemption problem: Context > Loss of progress (kill) > Unpredictable preemption times Function (checkpointing) Value Task run Panagiotis Garefalakis - Imperial College London 13
Suspendable tasks > Idea: use coroutines State Coroutine > Separate stacks to store task Stack Iterator state > Yield points handing over Context Context control to the executor Executor Function Stack > Cooperative preemption: yield call > Suspend and resume in milliseconds > Work-preserving Value > Transparent to the user Task run https://github.com/storm-enroute/coroutines Panagiotis Garefalakis - Imperial College London 14
Execution framework > Problem : not just assign but also suspend and resume > Idea : centralized scheduler with pluggable policies metrics Executor Executor Executor DAG scheduler Tasks Running Tasks Paused Incrementalizer Optimizer Low-pri job High-pri job suspend & launch run task task Task Scheduler Scheduling policy High Low App + job priorities Panagiotis Garefalakis - Imperial College London 15
Scheduling policies > Idea : policies trigger task suspension and resumption > Guarantee that stream tasks bypass batch tasks > Satisfy higher-level objectives i.e. balance cluster load > Avoid starvation by suspending up to a number of times > Load-balancing (LB): takes into account executors’ memory conditions and equalize the number of tasks per node > Locality- and memory aware (LMA): respect task locality preferences in addition to load-balancing Panagiotis Garefalakis - Imperial College London 16
Implementation > Built as an extension to 2.4.0 (https://github.com/lsds/Neptune) > Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines > Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities) > Added additional Executor performance metrics as part of the heartbeat mechanism Panagiotis Garefalakis - Imperial College London 17
Azure deployment > Cluster – 75 nodes with 4 cores and 32 GB of memory each > Workloads – LDA : ML training/inference application uncovering hidden topics from a group of documents – Yahoo Streaming Benchmark : ad-analytics on a stream of ad impressions – TPC-H decision support benchmark Panagiotis Garefalakis - Imperial College London 18
Benefit of NEPTUNE in stream latency 6 Streaming latency (s) 5 37 % 99th 4 3 61 % 2 13 % 54 % median 1 5th 0 Static FIFO FAIR KILL LMA LB Isolation DIFF-EXEC FIFO FAIR KILL NEP-CL NEP-LB PRI-ONLY allocation Neptune Neptune > LDA: training (batch) job using all available resources, with NEPTUNE achieves latencies comparable to a latency-sensitive inference (stream) using 15% of resources the ideal for the latency-sensitive jobs Panagiotis Garefalakis - Imperial College London 19
Impact of resource demands in performance Streaming latency (s) 6 3.95 Batch (M events/s) 3.92 1.5% 4 3.90 2 3.88 0 3.85 0% 20% 40% 60% 80% 100% Cores used for Streaming Past to future > YSB: increasing stream job resource demands while batch Efficiently share resources with low impact on job using all available resources throughput Panagiotis Garefalakis - Imperial College London 20
Summary NEPTUNE supports complex unified applications with diverse job requirements! > Suspendable tasks using coroutines > Pluggable scheduling policies > Continuous unified analytics https://github.com/lsds/Neptune Thank you! Panagiotis Garefalakis Questions? pgaref@imperial.ac.uk
Recommend
More recommend