Optimizing Shuffle in Wide-Area Data Analytics Shuhao Liu * , Hao Wang, Baochun Li Department of Electrical & Computer Engineering University of Toronto
What is: - Wide-Area Data Analytics? - Shuffle? � 2
Wide-Area Data Analytics Ireland Oregon data 2 data 1 N. Virginia N. California data 3 Singapore Large volumes of data are generated, stored and processed across geographically distributed DCs. � 3
Existing work focuses on Task Placement � 4
Rethink the root cause of inter-DC traffic: Shuffle � 5
Fetch-based Shuffle Mapper 1 Reducer 1 All-to-all Communication Mapper 2 Reducer 2 at Beginning of Reduce Tasks Mapper 3 Reducer 3 � 6
Problems with Fetch ‣ Under-utilize the inter-datacenter bandwidth ‣ Start late: beginning of reduce ‣ Start concurrently: share bandwidth ‣ Need for refetch ‣ Possible reduce task failure � 7
Push-based Shuffle ‣ Bandwidth Utilization Shu ffl e Write Stage N Stage N+1 Map Shu ffl e Read Reduce worker A (a) Map Shu ffl e Read Reduce worker B 0 4 8 12 16 time Shu ffl e Shu ffl e Stage N Stage N+1 Write Read Data Map Reduce worker A Transfer (b) Data Map Reduce worker B Transfer 0 4 8 12 16 time � 8
Push-based Shuffle ‣ Failure Recovery Shu ffl e Write Stage N Stage N+1 Failed Map Shu ffl e Read Refetch Reduce worker A Reduce (a) Map Shu ffl e Read Reduce worker B 0 4 8 12 16 20 24 time Stage N+1 Shu ffl e Shu ffl e Refetch Stage N Write Read Failed worker A Map Data Transfer Reduce Reduce (b) worker B Map Data Transfer Reduce 0 4 8 12 16 20 24 time � 9
Where to Push? ‣ Optional: existing task placement algorithms ‣ Know reducer placement before hand ‣ Require prior knowledge ‣ e.g., predictable jobs, inter-DC available bandwidth ‣ Our solution: Push/Aggregate � 10
Aggregating Shuffle Input ‣ Send shuffle input to a subset of datacenters with a large portion of shuffle input Reducer 1 ‣ Reduce inter-datacenter traffic in future shuffles Receiver Datacenter A Reducer 2 of shu ffl e ‣ Likely to reduce inter-datacenter traffic at current input shuffle Reducer 3 Inter-DC Transfers Reducer 4 Datacenter B Reducer 5 � 11
For any partition of shuffle input, the expected inter-datacenter traffic in next shuffle is proportional to the number of non-colocated reducers. � 12
Aggregating Shuffle Input ‣ Send shuffle input to a subset of datacenters with a large portion of shu ffl e input ‣ Reducer is likely to be placed close to shuffle input ‣ More aggregated data -> less inter-datacenter traffic with reasonable task placement � 13
Implementation in Spark ‣ Requirements: ‣ Push before writing to disk ‣ Destined to the aggregator datacenters ‣ transferTo() as an RDD transformation ‣ Allow implicit or explicit usage � 14
Implementation in Spark InputRDD InputRDD .map(…) .map(…) .reduce(…) .transferTo([A]) … .reduce(…) … InputRDD InputRDD A1 A2 B1 A1 A2 B1 map map map map map map A1 A2 B1 A1 A2 B1 transferTo transferTo transferTo A1 A2 A* reduce reduce A1, A2 A1, A2 reduce reduce B1 B1 A1, A2, Ax A1, A2, Ax (a) (b) � 15
5/12/2016 ScalaWordCount - Details for Job 0 1.6.1 (/) �������������� application UI 5/12/2016 ScalaWordCount - Details for Job 0 Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) 1.6.1 (/) �������������� application UI Executors (/executors/) Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/) ������� ��� ��� � Executors (/executors/) Implementation in Spark ������� �U��I�� ������ ������� � ������� ������� � ������� ��� ��� � ������� �U��I�� ������ ������� � ������� ������� � Stage � Stage � sequenceFile �e�uce���e� Stage � Stage � se�uence�ile re�uce���e� map map map map �at�ap �at�ap map map embedded trans�er�o transformation ������ ������ �1� ����� ������ ����� ����� � 16 (a) (b) �� ����������� ��������� �������� ������ ������ �1� ��������������� ����� ������ ���� ����� http://54.173.130.234:4040/jobs/job/?id=0 1/2 � (�ill) ����/��/�� � s �/� ����� ���� (/stages/stage/�ill/� �������� �� �� i����terminate�true) http://54.173.130.234:4040/jobs/job/?id=0 1/2
Implementation in Spark ‣ transferTo() implicit insertion Origin Code Produced Code val InRDD val InRDD Processed By = In1+In2 = In1+In2 DAGScheduler InRDD InRDD .filter(…) .filter(…) . groupByKey(…) . transferTo (…) .collect() . groupByKey(…) .collect() DC1 DC2 DC1 DC2 In1 In2 In1 In2 filter filter filter filter Shu ffl e Shu ffl e transferTo transferTo Input Input Shu ffl e Input group group ByKey ByKey Shu ffl e Input groupByKey collect groupByKey collect � 17
Evaluation ‣ Amazon EC2, m3.large instances ‣ 26 nodes in 6 different locations 4 N. Virginia Frankfurt 4 6 N. California Singapore 4 4 São Paulo Sydney 4 � 18
Performance The lower, the better � 19
Take-Away Messages ‣ Push-based shuffle mechanism is beneficial in wide-area data analytics ‣ Aggregating shuffle input to a subset of datacenters is likely to help when you have no priori knowledge ‣ Implementation in Apache Spark as a data transformation ‣ Performance: reduced shuffle time and its variance � 20
Thanks! Q&A � 21
Recommend
More recommend