SCALABLE DISTRIBUTED SUBGRAPH ENUMERATION AUTHORS: LONGBIN LAI LU QIN XUEMIN LIN YING ZHANG LIJUN CHANG
OUTLINE PROBLEM DEFINITION ALGORITHM FRAMEWORK TWINTWIG JOIN - VLDB15’ SEED EXPERIMENTS CONCLUSION
PROBLEM
PROBLEM DEFINTION SUBGRAPH ENUMERATION • Given a data graph , and a pattern graph , subgraph G P enumeration aims to find all subgraphs ( matches ), g ⊆ G that are isomorphic to . P u v v 1 • 1 4 u u 3 2 u 5 v v u u 2 3 6 4 P G ✓ ◆ v 1 v 2 v 3 v 4 u 1 u 2 u 5 u 3
PROBLEM DEFINTION SUBGRAPH ENUMERATION • Given a data graph , and a pattern graph , subgraph P G enumeration aims to find all subgraphs ( matches ), g ⊆ G that are isomorphic to . P u v v 1 • 1 4 u u 3 2 u 5 v v u u 2 3 6 4 P G ✓ ◆ v 1 v 2 v 3 v 4 u 4 u 2 u 3 u 5
PROBLEM DEFINTION SUBGRAPH ENUMERATION • Given a data graph , and a pattern graph , subgraph G P enumeration aims to find all subgraphs ( matches ), g ⊆ G that are isomorphic to . P u v v 1 • 1 4 u u 3 2 u 5 v v u u 2 3 6 4 P G ✓ ◆ v 1 v 2 v 3 v 4 u 6 u 3 u 2 u 5
FRAMEWORK
PATTERN DECOMPOSITION u v v 1 1 4 u u 3 2 u 5 v v u u 2 3 6 4 P = p 0 ∪ p 1 ∪ p 2 v 4 v 1 v 4 v 2 v 4 v 2 v 3 v 3 p 0 p 1 p 2 Join Units
WHAT CAN BE JOIN UNITS • Graph Storage Φ ( G ) = { G u | u ∈ V ( G ) } • Stored as for each data node ( u ; G u ) • : Local Graph of s.t. G u u • (1) Connected • (2) u ∈ V ( G u ) • (3) [ E ( G u ) = E ( G ) u ∈ V ( G )
WHAT CAN BE JOIN UNITS • A structure p can be a join unit iff. [ R G ( p ) = R G u ( p ) u ∈ V ( G ) • stands for the matches of in R G ( p ) G p
JOIN PLAN (TREE) • Decomposing P = p 0 ∪ p 1 ∪ p 2 ∪ p 3 • Solving: R ( P ) = R ( p 0 ) o n R ( p 1 ) o n R ( p 2 ) o n R ( p 3 ) R ( P ) o n R ( P 0 R ( p 3 ) 2 ) o n R ( p 2 ) R ( P 0 1 ) o n R ( p 0 ) R ( p 1 )
JOIN PLAN (TREE) • Decomposing P = p 0 ∪ p 1 ∪ p 2 ∪ p 3 • Solving: R ( P ) = R ( p 0 ) o n R ( p 1 ) o n R ( p 2 ) o n R ( p 3 ) R ( P ) n o R ( P 0 R ( p 3 ) 2 ) o n R ( p 2 ) R ( P 0 1 ) o n The matches of each join R ( p 0 ) R ( p 1 ) unit can be online computed independently in each local graph
JOIN PLAN (TREE) • Decomposing P = p 0 ∪ p 1 ∪ p 2 ∪ p 3 • Solving: R ( P ) = R ( p 0 ) o n R ( p 1 ) o n R ( p 2 ) o n R ( p 3 ) R ( P ) R ( P ) ⋉ ⋊ ⋊ ⋉ R ( P ′ 2 ) R ( P ′ R ( p 3 ) 2 ) ⋊ ⋉ ⋊ ⋉ R ( p 2 ) R ( P ′ R ( P ′ 1 ) 1 ) R ( p 2 ) R ( p 3 ) ⋉ ⋊ ⋊ ⋉ R ( p 0 ) R ( p 1 ) R ( p 0 ) R ( p 1 ) Left-deep tree Bushy tree
DESCRIBE THE ALGORITHMS • Graph Strorage mechanism • Determine the join units, thereafter the pattern decomposition • Join Structure • Left-deep tree vs bushy tree 14
TWINTWIG JOIN - VLDB15’
TWINTWIG JOIN - VLDB2015 SIMPLE GRAPH STORAGE • The simple graph storage, each local graph G u V ( G u ) = { u } ∪ N ( u ) E ( G u ) = { ( u, u 0 ) | u 0 ∈ N ( u ) } u 1 u 1 G u 1 u 2 u 3 u 2 u 3 u 4 u 6 u 5 u 2 G u 2 u 3 u 4 u 5 u 1 16
TWINTWIG JOIN - VLDB2015 SIMPLE GRAPH STORAGE • The simple graph storage, where V ( G u ) = { u } ∪ N ( u ) E ( G u ) = { ( u, u 0 ) | u 0 ∈ N ( u ) } … Star as the join unit 17
TWINTWIG JOIN - VLDB2015 SIMPLE GRAPH STORAGE • The simple graph storage, where V ( G u ) = { u } ∪ N ( u ) E ( G u ) = { ( u, u 0 ) | u 0 ∈ N ( u ) } A node with degree 1,000,000 will generate 3-stars 10 18 … Star as the join unit 18
TWINTWIG JOIN SIMPLE GRAPH STORAGE • Using twintwigs as the join units • Instance Optimality • Given any join plan involving general stars, we can solve it using twintwigs with at most the same ( often much less ) cost
TWINTWIG JOIN LEFT-DEEP JOIN PLAN • An optimal left-deep join plan with minimum estimated cost v 4 v 1 v 2 v 3 o n v 4 v 1 v 4 v 2 v 3 v 3 n o p 2 v 1 v 4 v 2 v 2 v 3 v 4 p 0 p 1
TWINTWIG JOIN DRAWBACKS • Simple storage mechanism only support using star as join units, too many intermediate results • Twintwig: confine to be at most two edges The node with degree 1,000,000 still have two- 10 12 • edge twintwigs • Too many execution rounds. A clique of 6 nodes (15 edges): Seven rounds of • TwinTwigJoin 21
TWINTWIG JOIN DRAWBACKS • Left-deep join: may result in sub-optimal results v 1 v 6 v 2 v 3 v 5 v 4 • n o o n v 1 v 1 v 1 v 5 v 1 v 6 v 2 v 6 v 2 v 4 v 5 v 4 v 3 v 4 v 5 v 3 R ( p 3 ) o n n o o n v 1 v 1 v 1 v 1 v 1 v 5 v 1 v 6 v 2 v 4 v 5 v 4 v 2 v 3 v 4 v 4 v 5 v 3 v 3 n o R ( p 2 ) R ( p 2 ) R ( p 3 ) R ( p 0 ) R ( p 1 ) v 2 v 1 v 1 v 4 Optimal solution is a bushy join v 3 v 3 R ( p 0 ) R ( p 1 ) 22
SEED - VLDB17’ MOTIVATIONS • S ubgraph E num E ration in D istributed Context • SCP (Star-Clique-Preserved) graph storage: Use star and clique as the join units We can avoid using star if clique is an alternative • Shorter execution. The 6-clique can now be processed in • one single round, instead of 7 rounds in TwinTwigJoin • Bushy join plan: Optimality Guarantee • Much better performance 23
SEED
SEED SCP GRAPH STORAGE • The SCP Graph Storage, where each local graph G + u V ( G + u ) = V ( G u ) = { u } ∪ N ( u ) E ( G + u ) = E ( G u ) ∪ { ( u 0 , u 00 ) | ( u 0 , u 00 ) ∈ E ( G ) ∧ u 0 , u 00 ∈ N ( u ) } 25
SEED SCP GRAPH STORAGE G + • The SCP Graph Storage, where each local graph u V ( G + u ) = V ( G u ) = { u } ∪ N ( u ) E ( G + u ) = E ( G u ) ∪ { ( u 0 , u 00 ) | ( u 0 , u 00 ) ∈ E ( G ) ∧ u 0 , u 00 ∈ N ( u ) } NEIGHBOUR EDGES 26
SEED SCP GRAPH STORAGE G + • The SCP Graph Storage, where each local graph u V ( G + u ) = V ( G u ) = { u } ∪ N ( u ) E ( G + u ) = E ( G u ) ∪ { ( u 0 , u 00 ) | ( u 0 , u 00 ) ∈ E ( G ) ∧ u 0 , u 00 ∈ N ( u ) } TRIANGLE EDGES 27
SEED SCP GRAPH STORAGE G + • The SCP Graph Storage, where each local graph u V ( G + u ) = V ( G u ) = { u } ∪ N ( u ) E ( G + u ) = E ( G u ) ∪ { ( u 0 , u 00 ) | ( u 0 , u 00 ) ∈ E ( G ) ∧ u 0 , u 00 ∈ N ( u ) } u 1 NEIGHBOUR EDGES TRIANGLE EDGES G + u 1 u 1 u 2 u 3 u 2 u 3 u 2 G + u 4 u 6 u 5 u 2 u 5 u 3 u 4 u 1 28
SEED SCP GRAPH STORAGE • We show that SCP graph storage supports using both star and clique as the join units • A more compact version which has bounded size for each local graph 29
SEED OPTIMAL BUSHY JOIN PLAN • Notations • : The join plan to solve E P P • : The cost of the join plan C ( E P ) • : Estimated # matches of P in G C ( P ) • We aim at finding a join plan for , s.t. P C ( E P ) is minimised 30
SEED OPTIMAL BUSHY JOIN PLAN • A dynamic programming transform function E P 0 P 0 • e.g. E P 0 o n • (1) E P 0 P 0 P 0 l l r E P 0 E P 0 • (2) E P 0 l r r • (3) R ( P 0 ) = R ( P 0 n R ( P 0 l ) o r ) 31
SEED OPTIMAL BUSHY JOIN PLAN • A dynamic programming transform function E P 0 P 0 • e.g. E P 0 o n • (1) E P 0 P 0 P 0 l l r E P 0 E P 0 • (2) E P 0 l r r • (3) R ( P 0 ) = R ( P 0 n R ( P 0 l ) o r ) l ) + C ( P 0 r ) + C ( P 0 C ( E P 0 ) = min { C ( E P 0 l ) + C ( E P 0 r ) } P 0 r = P 0 \ P 0 l ⇢ P 0 ^ P 0 l 32
EXPERIMENTS
EXPERIMENTS SETUP • Queries v 1 v 1 v 1 v 1 v 6 v 2 v 1 v 4 v 1 v 4 v 1 v 4 v 2 v 2 v 2 v 5 v 5 v 5 v 3 v 5 v 4 v 2 v 3 v 2 v 3 v 2 v 3 v 3 v 4 v 3 v 4 v 3 v 4 v 1 < v 3 v 1 < v 2 < v 3 v 1 < v 2 , v 1 < v 3 v 2 < v 5 v 1 < v 2 < v 3 < v 4 v 2 < v 5 v 3 < v 5 v 2 < v 4 v 3 < v 4 < v 5 v 1 < v 4 , v 2 < v 4 v 3 < v 4 q 1 q 2 q 3 q 4 q 5 q 6 q 7 • Algorithms SEED+O (The most optimised SEED) • TT (The most optimised TwinTwigJoin, VLDB 2015) • pSgL (Shao et al. Sigmod 2014) •
EXPERIMENTS SETUP • Cluster Amazon EC2: 1 master node, 10 slave nodes • Node Instance vCPU Memory Disk master m3.xlarge 4 15GB 2 x 40GBSSD slave c3.4xlarge 16 30GB 2 x 160GB SSD • Hadoop 2.6.2 JVM heap space: mapper 1524MB, reducer 2848MB • 6 mappers and 6 reducers each machine •
EXPERIMENTS RESULTS INF INF SEED+O SEED+O TT Running Time (s) Running Time (s) TT PSgL 10 4 10 4 PSgL 5206 10 3 10 3 612 220 220 134 134 107 10 2 10 2 29 10 1 10 1 yt lj yt lj
EXPERIMENTS RESULTS INF INF SEED+O SEED+O TT TT Running Time (s) Running Time (s) 10 4 PSgL 10 4 PSgL 5071 3282 1686 1281 10 3 10 3 780 279 10 2 10 2 63 60 28 10 1 10 1 yt lj yt lj v 1 v 1 v 4 v 2 v 5 v 2 v 3 v 4 v 3 v 1 < v 2 < v 3 < v 4 v 2 < v 5 q 3 q 4
EXPERIMENTS RESULTS INF INF SEED+O SEED+O TT TT Running Time (s) Running Time (s) 10 4 PSgL 10 4 PSgL 6968 5814 1013 10 3 10 3 850 306 229 10 2 10 2 66 10 1 10 1 yt lj yt lj v 1 v 1 v 6 v 2 v 2 v 5 v 3 v 5 v 4 v 3 v 4 v 2 < v 5 v 3 < v 5 v 3 < v 4 q 6 q 5
EXPERIMENTS RESULTS INF SEED+O TT Running Time (s) 10 4 PSgL 1206 10 3 493 129 10 2 29 10 1 yt lj
CONCLUSION • A general decompose-and-join framework to solve subgraph enumeration • TwinTwigJoin = Simple graph storage (twintwigs as the join units) + Optimal left-deep join • SEED = SCP graph storage (star and clique as the join units) + Optimal bushy join 40
Q & A THANK YOU!
Recommend
More recommend