green marl a dsl for easy and efficient graph analysis
play

GreenMarl: A DSL for Easy and Efficient Graph Analysis Sungpack - PowerPoint PPT Presentation

GreenMarl: A DSL for Easy and Efficient Graph Analysis Sungpack Hong*, Hassan Chafi* + , Eric Sedlar + , and Kunle Olukotun* *Pervasive Parallelism Lab, Stanford University + Oracle Labs Graph Analysis Classic graphs; New applications


  1. Green�Marl: A DSL for Easy and Efficient Graph Analysis Sungpack Hong*, Hassan Chafi* + , Eric Sedlar + , and Kunle Olukotun* *Pervasive Parallelism Lab, Stanford University + Oracle Labs

  2. Graph Analysis � Classic graphs; New applications � Artificial Intelligence, Computational Biology, … � SNS apps: Linkedin, Facebook,… Graph Analysis: a process of drawing out further information � Example> Movie Database from the given graph data�set “What would be the avg. hop�distance between any two (Australian) actors?” Sam Worthington ������ James “Is he a central figure in the movie Cameron Linda Hamilton network? How much?” Kevin Bacon Sigourney Weaver ,, “Do these actors work together ������ more frequently than others?” Jack Black Ben Stiller Owen Wilson

  3. More formally , � Graph Data*Set � ����� G = (V,E): ���������� relationship (E) between data entities (V) � �������� P: any extra data associated with each vertex or edge of graph G ������������������������� � Your Data*Set = (G, Π) = (G, P 1 , P 2 , … ) Your Data*Set = (G, Π) = (G, P , P , … ) � Graph analysis on (G, Π) � Compute a scalar value � e.g. Avg*distance, conductance, eigen*value, … � Compute a (new) property � e.g. (Max) Flow, betweenness centrality, page*rank, … � Identify a specific subset of G: � e.g. Minimum spanning tree, connected component, community structure detection, …

  4. The Performance Issue Traditional single*core machines showed limited � performance for graph analysis problems A lot of random memory accesses + data does not fit � in cache � Performance is bound to memory latency Conventional hardware (e.g. floating point units) does Conventional hardware (e.g. floating point units) does � not help much Use parallelism to accelerate graph analysis � Plenty of data*parallelism in large graph instances � Performance now depends on memory ��������� , not � ������� . Exploit modern parallel computers: Multi*core CPU, � GPU, Cray XMT, Cluster, ...

  5. New Issue: Implementation Overhead � It is challenging to implement a graph algorithm � correctly � + and efficiently � + while applying parallelism + while applying parallelism � + differently for each execution environment � ������������������������������������������ �����������������������������������������

  6. Our approach: DSL We design a domain specific language (DSL) for graph analysis � The user writes his/her algorithm concisely with our DSL � The compiler translates it into the target language (e.g. parallel � C++ or CUDA) (1) Inherent data�parallelism (1) Inherent data�parallelism (2) Good impl. templates (2) Good impl. templates Intuitive Efficient (parallel) Description of a Implementation of graph algorithm (3) High�level optimization the given algorithm ,, Foreach (t: G. Edgeset For(i=0;i<G.numN Nodes) odes();i++) { t.sigma += Foreach __fetch_and_add , (G.nodes[i], ,) BFS ��� DSL ���������������� ���������� Compiler ����������������������������

  7. Example: Betweenness Centrality � Betweenness Centrality (BC) Low BC High BC � A measure that tells how ‘central’ a node is in the graph � Used in social network analysis � Definition � How many shortest paths are How many shortest paths are there between any two nodes Kevin going through this node. Bacon Ayush K. Kehdekar [Image source; Wikipedia]

  8. Example: Betweenness Centrality Init BC for every node and begin outer�loop (s) [Brandes 2001] Looks complex s BFS Queues, Lists, w Order w Stack, Is this Is this parallelizable? v Compute sigma from parents s Reverse v BFS Order w w w Compute delta from children Accumulate delta into BC

  9. Example: Betweenness Centrality [Brandes 2001] s BFS w Order w v Compute sigma from parents s Reverse v BFS Order w w w Compute delta from children

  10. Example: Betweenness Centrality [Brandes 2001] s Parallel Iteration BFS Parallel w Order w Assignment v Parallel BFS Compute sigma from parents s Reverse v BFS Order w w w Compute delta from children Reduction

  11. DSL Approach: Benefits � Three benefits � Productivity � Portability � Performance

  12. Productivity Benefits � A common limiting resource in software development � your brain power (i.e. how long can you ����� ?) A C++ implementation of BC from SNAP ( a parallel graph library parallel graph library from GT): ≈ 400 line of codes (with OpenMP) Vs. Green�Marl* LOC: 24 *Green�Marl ( 그린 말 ) means ����������������� in Korean

  13. Productivity Benefits ��������� ������� ����������� ������ ���� ��� ��� BC ~ 400 24 SNAP C++ openMP Vertex Cover 71 21 SNAP C++ openMP Conductance 42 10 SNAP C++ openMP Page Rank Page Rank 75 75 15 15 http:// .. http:// .. C++ single thread C++ single thread SCC 65 15 http:// .. Java single thread � It is more than LOC � Focusing on the algorithm, not its implementation � More intuitive, less error*prone � Rapidly explore many different algorithms

  14. Portability Benefits (On�going work) � Multiple compiler targets Command line argument DSL DSL Description Compiler CUDA for Codes for (Parallelized) GPU Cluster C++ � SMP back*end � SMP back*end LIB (& RT) LIB (& RT) LIB (& RT) � Cluster back*end (*) � For large instances � We generate codes that work on Pregel API [Malewicz et al. SIGMOD 2010] � GPU back*end (*) � For small instances � We know some tricks [Hong et al. PPOPP 2011]

  15. Performance Benefits Optimized data structure Back�end specific & Code template optimization ��������������� Target Arch. Threading Lib, (SMP? GPU? (e.g.OpenMP) Distributed?) Graph Data Structure Compiler Arch. Arch. Parsing & Code Independent Dependent Checking Generation Opt Opt Use High�level Semantic ����������� Information ����������

  16. Arch�Indep�Opt: Loop Fusion ������� ������ ����� � ������� ������ ����� ��� �������������� ������������� Loop ������� ������ ����� � ���������������� Fusion ���������������� � “set” of nodes (elems are unique) ����������������������� ���������������������� ���������� C++ compiler cannot merge ���������������!���!������ loops ��� �����������"� ��������#�������������������� (Independence not �$%�&����$%�&� gauranteed) ��� �����������"� ��������#�������������������� �$%�&����$%�&����$%�&� Optimization enabled by high�level (semantic) information

  17. Arch�Indep�Opt: Flipping Edges Adding 1 to for all Outgoing Neighbors, if my B value is positive � Graph*Specific Optimization ������� ������ ����� � ������� ������ ����� ������'� ������� ������ ������ ������'� ������� ������ ������� � ��������� ��������� s t s s s t t s s (Why?) Reverse edges may not be Counting number of available or expensive to compute Incoming Neighbors whose B value is positive Optimization using domain�specific property

Recommend


More recommend