Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1
Graphs are ubiquitous 2 2
Graph Mining - Concepts • Label • Distinguishable property of a vertex (e.g. color). • Pattern - “Meta” sub-graph. • Captures subgraph structure and labelling • Embedding - Instance of a pattern. • Actual vertices and edges 1 2 1 1 4 4 3 4 2 3 3 2 5 6 6 6 6 6 Input graph Pattern Embeddings 3
Graph Mining: Cliques Property: Fully connected subgraphs 4 4
Graph Mining: Motifs Motifs Size = 3 Motifs Size = 4 5 5
Graph Mining: FSM • Frequent Subgraph mining in a single large graph. 7 14 1 9 2 13 5 3 8 4 10 6 11 12 • Find subgraphs that have a minimum embedding count 6 6
Applications • Web: • Community detection, link spam detection • Semantic data: • Attributed patterns in RDF • Biology: • Characterize protein-protein or gene interaction 7 7
Challenges # unique embedding (log-scale) 1.7B 117M 7.8M 335K 22K 4K 1 2 3 4 5 6 Size of embedding • Exponential number of embeddings 8 8
Challenges • No standard way to solve theses problems. • No way to distribute the processing easily. • Way too complicated for programmers (Many … isms) • Detect and identify repeated subgraphs – Automorphisms • Aggregate to Pattern – Isomorphism • Above all not all problems are tractable. No cluster grows exponentially. 9 9
State of the Art: Custom Algorithms Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom 10 10
State of the Art: Think Like a Vertex Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom ✗ ✗ ✓ Think Like a Vertex 11 11
Arabesque • New execution model & system • Think Like an Embedding • Purpose-built for distributed graph mining • Hadoop-based • Contributions: • Simple & Generic API • High performance • Distributed & Scalable by design 12 12
Arabesque Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom ✗ ✗ ✓ Think Like a Vertex Arabesque ✓ ✓ ✓ 13 13
Graph Mining - Concepts • Label • Distinguishable property of a vertex (e.g. color). • Pattern - “Meta” sub-graph. • Captures subgraph structure and labelling • Embedding - Instance of a pattern. • Actual vertices and edges 1 2 1 1 4 4 3 4 2 3 3 2 5 6 6 6 6 6 Input graph Pattern Embeddings 14
API Example: Clique finding boolean filter (Embedding e) { 1 State of the Art return isClique (e); 2 } 3 (Mace, centralized) void process (Embedding e) { 4 4,621 LOC output (e); 5 } 6 boolean shouldExpand (Embedding embedding) { 7 return embedding.getNumVertices() < maxsize ; 8 } 9 boolean isClique (Embedding e) { 10 return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; 11 } 12 15 15
API Example: Motif Counting State of the Art boolean filter (Embedding e) { 1 (GTrieScanner, centralized) return true; 2 } 3 3,145 LOC void process (Embedding embedding) { 4 output(embedding); 5 map( AGG_MOTIFS , embedding.getPattern(), reusableLongWritableUnit ); 6 } 7 boolean shouldExpand (Embedding embedding) { 8 return embedding.getNumVertices() < maxsize ; 9 } 10 16 16
API Example: FSM • Ours was the first distributed implementation • 280 lines of Java Code • … of which 212 compute frequent metric • Baseline (GRAMI): 5,443 lines of Java code. 17 17
Arabesque: An Efficient System • As efficient as centralized state of the art Centralized Arabesque Application - Graph Baseline 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s 77s FSM - CiteSeer (S=300) 4.8s 5s 18 18
Arabesque: A Scalable System • Scalable to thousands of workers • Hours/days → Minutes Arabesque Application - Graph Centralized Baseline 640 cores Motifs - MiCo 2 hours 24 minutes 25 seconds First Distributed Implementation Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds 19 19
How: Arabesque Optimizations • Avoid Redundant Work • Avoid Redundant Work • Efficient canonicality checking • Efficient canonicality checking • Subgraph Compression • Subgraph Compression • Overapproximating Directed Acyclic Graphs (ODAGs) • Overapproximating Directed Acyclic Graphs (ODAGs) • Efficient Aggregation • 2-level pattern aggregation 20 20
Outline • Graph mining exploration & Arabesque fundamentals • System Architecture & Optimizations • Evaluation of System • How to Run & Code 21
Graph mining exploration & Arabesque fundamentals
Graph Mining - Exploration • Iterative expansion • Subgraph size n → Subgraph size n + 1 • Connect to neighbours, one vertex at a time. 1 2 3 1 1 1 3 3 2 1 2 2 2 1 3 4 3 2 3 4 2 3 4 4 2 4 4 3 Input graph Depth 1 Depth 2 23 23
Graph Mining - Exploration 1 2 3 2 1 3 1 2 4 2 3 1 1 3 2 2 3 4 1 3 4 2 4 3 1 2 Depth 3 3 1 2 4 2 1 3 4 3 2 1 Input graph 4 2 3 3 2 4 4 3 1 3 4 2 4 3 2 24 24
Arabesque: Fundamentals • Embeddings as 1st class citizens: • Think Like an Embedding model Arabesque responsibilities User responsibilities Graph Aggregation Filter Exploration (Isomorphism) Load Automorphism Process Balancing Detection 25 25
Model - Think Like an Embedding Exploration step 3 Exploration step 1 Exploration step 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 1 3 4 3 3 3 4 6 3 6 3 6 6 6 3 6 5 6 Input Output Input Output Input Output true 1 2 e Filter Process 1 2 6 false 1 3 Save 1 2 Discard 3 1. Start from a 2. Candidates : 3. Filter 4. Produce outputs set of initial Expand by 1 vertex/ uninteresting embeddings edge candidates User-defined functions 26 26
API Example: Clique finding boolean filter (Embedding e) { 1 return isClique (e); 2 } 3 void process (Embedding e) { 4 output (e); 5 } 6 boolean shouldExpand (Embedding embedding) { 7 return embedding.getNumVertices() < maxsize ; 8 } 9 boolean isClique (Embedding e) { 10 return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; 11 } 12 27 27
Guarantee: Completeness For each e, if filter(e) == true then Process(e) is executed Requirement: Anti-monotonicity 1 2 Filter = true 6 1 2 1 2 Filter = true Filter = false 3 6 4 6 Keep expanding Filter = false We can prune and be sure that we won’t ignore desired embeddings 28
Aggregation during expansion • Filter might need aggregated values • E.g.: Frequent subgraph mining • Frequency calculation → look at all candidates • Aggregation in parallel with exploration step • Embeddings filtered as soon as aggregated values are ready. 29 29
Aggregation during expansion • Filter function may depend on aggregated data • E.g.: Frequent subgraph mining • Frequency requires looking at all candidates Aggr. key-value pairs from previous step Aggr. key-value pairs for next step map(k, v) 1 Agg e e’ Agg Process 1 2 1 2 Filter Process ... 1 3 1 3 Discard Save ... 1. Initial embeddings 1-1. Filter using aggr. 1-2. Process using 4. Produce outputs 2. Candidates : Expand and aggr. values values aggr. values by 1 vertex/edge ... ... Exploration step 2 Exploration step 1 User-defined functions 30 30
Arabesque API • Main App-defined functions: boolean filter(E embedding); void process(E embedding); boolean shouldExpand(E newEmbedding); // Terminate early if max depth defined boolean aggregationFilter(E Embedding); // Ignore embedding boolean aggregationFilter(Pattern pattern); // Ignore pattern (ex. not frequent) void aggregationProcess(E embedding); void handleNoExpansions(E embedding); • Performance improvements: void filter(E existingEmbedding, IntCollection extensionPoints); // prune extensions boolean filter(E existingEmbedding, int newWord); // Canonicality check • Functions Provided by Arabesque: void output(String outputString); void map(String name, K key, V value); AggregationStorage<K, V> readAggregation(String name); 31 31
System Architecture & Optimizations
Arabesque Architecture Input Output Embeddings Embeddings size size n n + 1 Worker 1 split 1 split 1 split 4 split 4 split 7 split 7 Worker 2 Previous step Next step split 2 split 2 split 5 split 5 split 8 split 8 Worker 3 split 3 split 3 split 6 split 6 split 9 split 9 33 33
Avoiding redundant work • Problem: Automorphic embeddings • Automorphisms == subgraph equivalences • Redundant work == 1 2 3 3 2 1 Worker 1 Worker 2 34 34
Avoiding redundant work • Solution: Decentralized Embedding Canonicality • No coordination • Efficient == 1 2 3 3 2 1 Worker 1 Worker 2 isCanonical(e) → true isCanonical(e) → false 35 35
Efficient Pattern Aggregation • Goal: Aggregate automorphic patterns to single key • Find canonical pattern • No known polynomial solution 1 2 2 4 3 5 3x Expensive graph canonization Canonical pattern 36
More recommend