Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - PowerPoint PPT Presentation

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1

Big Graphs Are Everywhere [SIGMOD’16 Tutorial] 2

A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Assumption: Network is the bottleneck. 3

The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] ✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ✓ Infiniband: 1.7GB/s~37.5GB/s ○ 2 FDR 4x NICs per socket ✓ DDR3: 6.25GB/s~16.6GB/s 4

The End of Slow Networks: Does edge-cut still matter? 5

Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 6

The End of Slow Networks: Does edge-cut still matter? Graph Partitioners METIS and LDG Graph Workloads BFS, SSSP , and PageRank Graph Dataset Orkut (|V|=3M, |E|=234M) Number of Partitions 16 (one partition per core) 7

The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) m:s:c METIS LDG m: # of machines used 1:2:8 633 2,632 s: # of sockets used per machine c: # of cores used per socket 2:2:4 654 2,565 9x 4:2:2 521 631 8:2:1 222 280 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 8

The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 9

The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 10

The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. The distribution of edge-cut matters. ○ Network may not always be the bottleneck . ○ 11

The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ METIS had lower execution time and LLC misses than LDG. ○ Edge-cut matters. ○ Higher edge-cut-->higher comm-->higher contention 12

The End of Slow Networks: Does edge-cut still matter? Yes! Both edge-cut and its distribution matter! ✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory subsystems of modern multicore machines. 13

Intra-Node Data Comm: Shared Memory Sending Core Receiving Core 4b. Write 2b. Write 3. Load 1. Load 4a. Load 2a. Load Shared Buffer Receive Buffer Send Buffer Extra Memory Copy 15

Intra-Node Data Comm: Shared Memory Cached Send/Shared/Receive Buffer Cache Pollution LLC and Memory Bandwidth Contention 16

Intra-Node Data Comm: Shared Memory Cached Send/Shared Buffer Cached Receive/Shared Buffer Cache Pollution LLC and Memory Bandwidth Contention 17

Excess intra-node data communication may hurt performance. 18

Inter-Node Data Comm: RDMA Read/Write Node#1 Node#2 Sending Core Sending Core Send Receive Buffer Buffer IB IB HCA HCA No Extra Memory Copy and Cache Pollution 19

Offloading excess intra-node data comm across nodes may achieve better performance. 20

Argo: Graph Partitioning Model Vertex Stream ... Partitioner ... Streaming Graph Partitioning Model [I. Stanton, KDD’12] 22

Argo: Architecture-Aware Vertex Placement Place vertex, v , to a partition, Pi , that maximize: Weighted Edge-cut Penalize the placement based on the load of Pi ✓ Weighted by the relative network comm cost , Argo will ○ avoid edge-cut across nodes (inter-node data comm). Great for cases where the network is the bottleneck. 23

Argo: Architecture-Aware Vertex Placement Degree of Contention ( 𝞵 ∈ [0, 1]) Bottleneck Network Memory 𝞵 =0 𝞵 =1 Maximal Inter-Node Refined Intra-Node Original Intra-Node Network Comm Cost Network Comm Cost Network Comm Cost ✓ Weighted by the refined relative network comm cost , Argo will ○ avoid edge-cut across cores of the same node (intra-node data comm). 24

Evaluation: Workloads & Datasets  Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank  Three Real-World Large Graphs Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B 26

Evaluation: Platform Cluster Configuration # of Nodes 32 Network Topology FDR Infiniband (Single Switch) Network Bandwidth 56Gbps Compute Node Configuration 2 Intel Haswell # of Sockets (10 cores / socket) L3 Cache 25MB 27

Evaluation: Partitioners  METIS: the most well-known multi-level partitioner.  LDG: the most well-known streaming partitioner.  ARGO-H: network is the bottleneck. weight edge-cut by the original network comm costs. o  ARGO: memory is the bottleneck. weight edge-cut by the refined network comm costs. o 28

Evaluation: SSSP Exec. Time on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 5x 4x 3x 3x 2x 2x 2x 2x 1.4x 1x 1x 1x Message Grouping Size ✓ ARGO had the lowest SSSP execution time. (Group multiple msgs by a single SSSP process to the same destination into one msg) 29

Evaluation: SSSP LLC Misses on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 50x 38x 12x 9x 9x 6x 4x 3x 1x 1x 1.2x 1x Message Grouping Size ✓ ARGO had the lowest LLC Misses. 30

Evaluation: SSSP Comm Vol. on Orkut dataset 64 Intra-Socket ★ Orkut: |V| = 3M, |E| = 234M METIS 69% ★ 60 Partitions: three 20-core machines LDG 49% ARGO-H 70% ✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters. 31

Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512 ✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased. 32

Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 ✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H. 33

Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 * 160 = 13h * 180 = 6h ✓ Hours CPU Time Saving. 34

Evaluation: Partitioning Overhead ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution) # of Partitions # of Partitions ✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time. 35

Conclusions  Findings o Network is not always the bottleneck. Thanks! o Contention on memory subsystems may impact the performance a lot  due to excess intra-node data comm. o Both edge-cut and its distribution matter. Acknowledgments :  Peyman Givi  ARGO  Patrick Pisciuneri o voids contention by offloading excess Funding : intra-node data comm across nodes.  NSF CBET-1609120 o Achieves up to 11x improvement on  NSF CBET-1250171 real-world workloads.  BigData’16 Student o Scales well in terms of both graph size Travel Award and number of partitions. 36

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - PowerPoint PPT Presentation

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1 Big Graphs

Arvor-Provor Technical Workshop Monitoring Argo floats Euro-Argo Ifremer, 28-30 January 2020

ARGO http://argoeu.github.io ARGO Availability and Reliability Monitoring Christos

Argo: Architecture-Aware Graph Par33oning Angen Zheng Alexandros

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng ,

Ocean Monitoring Service Award The he 2018 8 POM OMA A Awar ard d Rec ecipient ipient is:

Highlights from ARGO-YBJ G. Di Sciascio INFN Sez. Roma Tor Vergata On behalf of the

Annual General Meeting 2014 Disclaimer Argo Investments Limited has prepared this

only source of topology information for ARGO Alessandro Paolini www.egi.eu This work by EGI.eu

ARGO - Automated Reasoning GrOup Filip Mari c Faculty of Mathematics, University of Belgrade

SAM/ARGO Status update P. Korosoglou (GRNET/AUTH) EGI-TJRA2.1 Evolution of the ARGO platform:

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

PAP: Power Aware Partitioning of Reconfigurable Systems Vijay R. P. Kappagantula Rabi Mahapatra

Algorithms in Uncertainty Quantification Kickoff UNMIX project Mario Teixeira Parente Department

RainBench: Enabling Data-Driven Precipitation Forecasting on a Global Scale Catherine Tong

Meteor vor dem Einschlag Ein flexibles JavaScript Framework Heiko Spindler METEOR BEFORE IMPACT

Sled gehammer Hell The Day after Jud gment Jasmin C. Blanchette TU Mnchen Larry Paulson Jia

Useful info on SNT workflows Nick Amin Overview Two main parts Data/metadata retrieval

Adaptive Coarse Spaces and Multiple Search Directions: Tools for Robust Domain Decomposition

Jon F. Kerner, Ph.D. Canadian Partnership Against Cancer Policy Priorities from Canada, &

Learning-Assisted Reasoning within Interactive Theorem Provers Thibault Gauthier May 17, 2019 1

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - PowerPoint PPT Presentation

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1 Big Graphs

Arvor-Provor Technical Workshop Monitoring Argo floats Euro-Argo Ifremer, 28-30 January 2020

ARGO http://argoeu.github.io ARGO Availability and Reliability Monitoring Christos

Argo: Architecture-Aware Graph Par33oning Angen Zheng Alexandros

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng ,

Ocean Monitoring Service Award The he 2018 8 POM OMA A Awar ard d Rec ecipient ipient is:

Highlights from ARGO-YBJ G. Di Sciascio INFN Sez. Roma Tor Vergata On behalf of the

Annual General Meeting 2014 Disclaimer Argo Investments Limited has prepared this

only source of topology information for ARGO Alessandro Paolini www.egi.eu This work by EGI.eu

ARGO - Automated Reasoning GrOup Filip Mari c Faculty of Mathematics, University of Belgrade

SAM/ARGO Status update P. Korosoglou (GRNET/AUTH) EGI-TJRA2.1 Evolution of the ARGO platform:

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

PAP: Power Aware Partitioning of Reconfigurable Systems Vijay R. P. Kappagantula Rabi Mahapatra

Algorithms in Uncertainty Quantification Kickoff UNMIX project Mario Teixeira Parente Department

RainBench: Enabling Data-Driven Precipitation Forecasting on a Global Scale Catherine Tong

Meteor vor dem Einschlag Ein flexibles JavaScript Framework Heiko Spindler METEOR BEFORE IMPACT

Sled gehammer Hell The Day after Jud gment Jasmin C. Blanchette TU Mnchen Larry Paulson Jia

Useful info on SNT workflows Nick Amin Overview Two main parts Data/metadata retrieval

Adaptive Coarse Spaces and Multiple Search Directions: Tools for Robust Domain Decomposition

Jon F. Kerner, Ph.D. Canadian Partnership Against Cancer Policy Priorities from Canada, &amp;

Learning-Assisted Reasoning within Interactive Theorem Provers Thibault Gauthier May 17, 2019 1

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Jon F. Kerner, Ph.D. Canadian Partnership Against Cancer Policy Priorities from Canada, &