Coflow Recent Advances and What’s Next? Mosharaf Chowdhury University of Michigan
Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing Coflow Networking Open Source Apache Spark Open Source Cluster File System Facebook Resource Allocation Microsoft Proactive Analytics Fast Analytics DAG Scheduling Apache YARN Before You Think! Over the WAN Cluster Caching Alluxio
Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing < 0.01 ms ~ 1 ms > 100 ms
Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources • Web, mobile, wearables, vehicles, scientific, … Cheaper disks, SSDs, and memory Stalling processor speeds
Big Datacenters for Massive Parallelism BlinkDB Storm Spark-Streaming Pregel GraphLab GraphX DryadLINQ Spark Dremel MapReduce Hadoop Dryad Hive 2005 2010 2015 1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI’2012.
Distributed Data-Parallel Applications Multi-stage dataflow • Computation interleaved with communication Computation Stage (e.g., Map, Reduce) • Distributed across many machines Reduce Stage • Tasks run in parallel A communication stage cannot complete Communication Stage (e.g., Shuffle) until all the data have been transferred • Between successive computation stages Map Stage
Communication is Crucial Performance Facebook jobs spend ~ 25% of runtime on average in intermediate comm. 1 As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.
Faster Communication Transfers data from a source Stages: Flow to a destination Traditional Independent unit of allocation, Networking sharing, load balancing, and/or prioritization Approach
Existing Solutions WFQ CSFQ D 3 DeTail PDQ pFabric GPS RED ECN XCP RCP DCTCP D 2 TCP FCP 1980s 1990s 2000s 2005 2010 2015 Per-Flow Fairness Flow Completion Time Independent flows cannot capture the collective communication behavior common in data-parallel applications
Why Do They Fall Short? r 1 r 2 r 1 r 2 1 1 2 2 s 1 s 2 s 3 s 1 s 2 s 3 Datacenter 3 3 Fabric Input Links Output Links
Why Do They Fall Short? r 1 r 2 r 1 r 2 r 1 s 1 1 1 r 2 s 2 2 2 s 1 s 2 s 3 s 1 s 2 s 3 Datacenter s 3 3 3 Fabric
Why Do They Fall Short? r 1 s 1 1 1 r 2 s 2 2 2 Datacenter s 3 3 3 Fabric Per-Flow Fair Sharing Shuffle Completion 3 Solutions focusing on flow Link to r 1 3 Time = 5 5 completion time cannot further 3 Avg. Flow decrease the shuffle completion time Link to r 2 3 5 Completion Time = 3.66 2 4 6 time
Improve Application-Level Performance 1 r 1 s 1 1 1 Slow down faster flows to accelerate r 2 s 2 2 2 slower flows Datacenter s 3 3 3 Fabric Per-Flow Fair Sharing Data-Proportional Allocation Per-Flow Fair Sharing Shuffle Shuffle Completion Completion 4 3 Link to r 1 Link to r 1 4 3 Time = 5 Time = 4 5 4 4 3 Avg. Flow Avg. Flow Link to r 2 Link to r 2 4 3 5 Completion 4 Completion Time = 3.66 Time = 4 2 4 6 2 4 6 time time 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.
Coflow Communication abstraction for data-parallel applications to express their performance goals 1. Size of each flow; 2. Total number of flows; 3. Endpoints of individual flows; 4. Dependencies between coflows;
Broadcast Single Flow Aggregation All-to-All Parallel Flows Shuffle
… for faster #1 completion 1 1 of coflows? How to 2 2 … to meet schedule #2 more . . coflows deadlines? . . online … . . … for fair #3 allocation of the network? N N Datacenter
Varys, Aalo & HUG 3 1 2 Faster, application-aware data transfers 1. Coflow Scheduler throughout the network Consistent calculation and enforcement of 2. Global Coordination scheduler decisions Decouples network optimizations from 3. The Coflow API applications, relieving developers and end users 1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014. 2. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015. 3. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016.
Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 6 Units Link 2 Link 1 3 Units 2 Units Fair Sharing Smallest-Flow First 1,2 The Optimal L2 L2 L2 L1 L1 L1 2 6 2 6 4 4 2 4 6 time time time Coflow1 comp. time = 3 Coflow1 comp. time = 5 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow2 comp. time = 6 Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
is NP-Hard Inter-Coflow Scheduling Coflow 1 Coflow 2 6 Units Link 2 Link 1 3 Units 2 Units Input Links Output Links Concurrent Open Shop Scheduling 1 1 with Coupled Resources • Examples include job scheduling and caching blocks 2 2 6 • Solutions use a ordering heuristic • Consider matching constraints 2 3 3 Datacenter 3
Many Problems to Solve Clairvoyant Objective Optimal Varys Yes Min CCT No Aalo No Min CCT No HUG No Fair CCT Yes
Coflow -Based Architecture Centralized master-slave architecture Local Local Local Daemon Daemon Daemon • Applications use a client library to communicate with the master Network Interface Actual timing and rates are determined by the coflow scheduler Coordination Coflow Scheduler Master/Coordinator Computation tasks f
Coflow API Change the applications DO NOT change the applications 1 • At the very least, we need to know • Infer coflows from traffic network what a coflow is traffic patterns • For clairvoyant versions, we need • Design robust coflow scheduler that more information can tolerate misestimations Changing the framework can Our current solution only works enabled ALL jobs to take advantage for coflows without dependencies; of coflows we need DAG support! 1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.
Performance Benefits of Using Coflows Lower is Better 25 22.07 Overhead Over Varys 20 15 10 5.65 5.53 5 3.21 1.10 1.00 0 1 2,3 4 Varys Aalo Per-Flow Per-Flow Varys Fair FIFO Priority FIFO-LM NC Fairness Prioritization 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
The Need for Coordination Coordination is necessary to 1000 992 determine realtime 495 Average Coordination Time (ms) • Coflow size (sum); 100 115 • Coflow rates (max); • Partial order of coflows (ordering); 17 10 Can be a large source of overhead 8 • Does not impact too much for large coflows in slow networks, but … 1 How to perform decentralized coflow 100 1000 10000 50000 100000 scheduling? # (Emulated) Aalo Slaves
Coflow-Aware Load Balancing Especially useful in asymmetric topologies • For example, in the presence of switch or link failures Provides an additional degree of freedom • During path selection • For dynamically determining load balancing granularity Increased need for coordination, but at an even higher cost
Coflow-Aware Routing Relevant in topologies w/o full bisection bandwidth • When topologies have temporary in-network oversubscriptions • In geo-distributed analytics Scheduling-only solutions do not work well • Calls for routing-scheduling joint solutions • Must take network utilization into account • Must avoid frequent path changes Increased need for coordination
Coflows in Circuit-Switched Networks Circuit switching is relevant again due to the rise of optical networks • Provides very high bandwidth • Expensive to setup new circuits Co-scheduling applications and coflows • Schedule tasks so that we can reuse already-setup circuits • Perform in-network aggregation using existing circuits instead of waiting for new circuits to be created
Extension to Multiple Resources 1 A DAG of coflows is very similar to a job DAG of stages • Same principle applies, but with new challenges Consider both fungible (b/w) and non-fungible resources (cores) • Across the entire DAG 1. Altruistic Scheduling in Multi-Resource Clusters, OSDI2016.
Coflow Communication abstraction for data-parallel applications to express their performance goals Key open challenges 1. Better theoretical understanding 2. Efficient solutions to deal with decentralization, topologies, multi-resource settings, estimations over DAG, circuit-switching, etc. More information 1. Papers: http://www.mosharaf.com/publications/ 2. Software/simulator/workloads: https://github.com/coflow
Recommend
More recommend