sparql graph pattern processing
play

SPARQL Graph Pattern Processing with Apache Spark title ?P - PowerPoint PPT Presentation

SPARQL Graph Pattern Processing with Apache Spark title ?P speaker Hubert Naacke University author P. et M. Curie Paris 6 Olivier Cur Paris Est Marne-la-Valle Bernd Amann GRADES 2017 1 Context Big RDF data Linked Open Data


  1. SPARQL Graph Pattern Processing with Apache Spark title ?P speaker Hubert Naacke University author P. et M. Curie Paris 6 Olivier Curé Paris Est Marne-la-Vallée Bernd Amann GRADES 2017 1

  2. Context • Big RDF data • Linked Open Data impulse: ever growing RDF content • Large datasets: billions of <subject, prop, object> triples • e.g. DBPedia • Query RDF data in SPARQL • The building block is a Basic Graph Pattern (BGP) query • e.g.: Snowflake pattern offers includes ? u Retail0 ? x from WatDiv benchmark Chain pattern t1 t2 t3 from LUBM benchmark type Course ?y ?z advisor teacherOf 2 GRADES 2017

  3. Cluster computing platforms • Cluster computing platforms provide • main memory data management • distributed and parallel data access and processing • fault-tolerance, highly availability ➭ Leverage on existing platform • e.g. Apache Spark GRADES 2017 3

  4. SPARQL on Spark Architecture SPARQL Graph Pattern query Our Hybrid Hybrid SPARQL SPARQL SPARQL DF RDD solutions SQL DF RDD SQL data compression GraphX DataFrame (DF) no compression Resilient Distributed Datastructures (RDD) Cluster ressource management Distributed File system RDF RDF triples triples 4 GRADES 2017

  5. SPARQL query evaluation: Challenges • Requirements • Low memory usage: no data replication, no indexing • Fast data preparation: simple hash-based <Subject> partitioning • Challenges • Efficiently evaluate parallel and distributed join plans with Spark ➭ Favor local computation ➭ Reduce data transfers • Benefit from several join algorithms • Local partitioned join: no transfer • Distributed partitioned join • Broadcast join GRADES 2017 5

  6. Solution • Local subquery evaluation • Merge multiple triple selections aka shared scan • Distributed query evaluation • Cost model for partitioned and broadcast joins • Generate Hybrid join plans, dynamic programming GRADES 2017 6

  7. Hybrid plan : example and cost model Legend: SELECT * WHERE { Partitioned Broadcast B P Distribute Broadcast ⋈ ?x advisor ?y . ⋈ join join ?y teacherOf ?z . ?z type Course } Q9 1 Q9 2 Q9 3 P ⋈ y B P ⋈ y ⋈ z t1 t2 t3 ?y ?z advisor teacherOf type x t 3 x t 1 P t 1 B ⋈ z B Triple patterns of Q9 ⋈ y ⋈ z x y z y t 2 t 1 t 2 t 3 t 3 t 2 Plan cost : SPARQL RDD SPARQL DF SPARQL Hybrid with : plan plan plan C pattern = transferCost(pattern) cost(Q9 1 ) = cost(Q9 2 ) = cost(Q9 3 ) = θ comm is the unit tranfer cost C t1 + C t2 + C t2 ⨝ t3 m * (C t2 + C t3 ) C t1 + m * C t3 m = #computeNodes - 1

  8. Performance comparison with S2RDF • S2RDF at VLDB 2016 • Same dataset (1B triples) & queries • Various query patterns: • S tar, snow F lake, C omplex Star Snowfake Complex ➭ One dataset: ➭ One dataset per property: <Subject> partitioning <Property> and <Subject> partitioning Hybrid DF accelerates DF Hybrid accelerates S2RDF up to 2.4 times up to 2.2 times GRADES 2017 8

  9. Take home message • Existing cluster computing platforms are mature enough to process SPARQL queries at large scale. • To accelerate query plans: • Provide several distributed join algorithms • Allows for mixing several join algorithms More info at the poster session … Thank you. Questions ? GRADES 2017 9

  10. Existing solutions • S2RDF (VLDB 2016) • Spark • Long data preparation time • Use a single join algo • CliqueSquare (ICDE 2015) • Hadoop platform • Data replicated 3 times: by subject, prop and object • AdPart (VLDBJ 2016) • Native distributed layer • "semi-join based" join algorithm • Distributed RDFox (ISWC 2016) • Native distributed layer • Data shuffling GRADES 2017 10

  11. Conclusion • First detailed analysis of SPARQL processing on Spark • Cost model aware of data transfers • Efficient query plan generation • Optimality not studied (future works) • Extensive experiments at large scale • Future works: incorporate other recent join algo • Handle data bias • Hypercube n-way join: targets load balancing GRADES 2017 11

  12. Thank you Questions ? GRADES 2017 12

  13. Extra slides • GRADES 2017 13

  14. Hybrid plan: Cost model Q9 1 Q9 3 Q9 2 Plan cost : P ⋈ y B P ⋈ y ⋈ z with : C pattern = θ comm * size(pattern) x t 3 x t 1 P t 1 θ comm is the unit tranfer cost B ⋈ z ⋈ y B ⋈ z m = #computeNodes - 1 x y z y t 2 t 1 t 2 t 3 t 3 t 2 SPARQL RDD SPARQL SQL SPARQL Hybrid plan plan plan cost(Q9 1 ) = cost(Q9 2 ) = cost(Q9 3 ) = C t1 + C t2 + C t2 ⨝ t3 m * (C t2 + C t3 ) C t1 + m * C t3

  15. Data distribution (1/2) Hash-based partitioning Dataset ( subject , prop, object) s1 p1 o1 Partitioning is s2 p1 o2 s3 p1 o2 • Straightforward s1 p2 o3 • Simple map-reduce task s2 p3 o4 ... • No preparation overhead requirement • Hash-based partitioning on subject s1 p1 o1 s2 p1 o2 s3 p1 o2 s1 p2 o3 s2 p3 o4 Part 1 Part 2 Part N BDA 2016 15

  16. Data distribution (2/2) over a cluster Compute node N node 2 node 1 Ressources: Piece of Memory data CPU Operation Memory Result Comm is expensive BDA 2016 16

  17. Parallel and distributed data processing workflow (1/2) Compute node 1 node 2 node N Part Part Part Partitioned dataset 1 2 N Local (MAP) select select select Operation Result Result Result Partitioned Result 1 2 N Examples of local MAP operations:  selection, projection, join on subject BDA 2016 17

  18. Parallel and distributed data processing workflow (2/2) Part Part Part Dataset 1 2 n Data transfers Global (REDUCE) Global Global Global Operation Operation Operation Operation Result Result Result 1 2 n Examples of global REDUCE operations : join , sort, distinct BDA 2016 18

  19. Join processing wrt. query pattern Data: Star query: • Find laboratory and name of persons P1 lab L1 L1 at Poitiers P1 name Ali L1 since 2000 lab ?L P3 lab L2 L3 at Paris P3 name Clo L3 staff 200 name No transfer ?P ?N age P2 lab L3 L2 at Aix ?L P2 age 20 L2 at Toulon P2 name Bob L2 partner L1 Chain query: P4 lab L1 … • Find lab and its city for persons Transfer at lab ?P ?L ?V lab or at Snowflake ?n name ?s query: staff lab at ?P ?L ?V Complex query partner age ?N ?a BDA 2016 19

  20. Join algorithms • Partitioned join (Pjoin) • Distribute data • Broadcast join (Brjoin) • Broadcast to all •  Hybrid join (contribution) • Distribute and/or broadcast • Based on a cost model BDA 2016 20

  21. Cost of Join (1/2) Partitioned join Part 1 Part n Part 1 Part n C1 loc L3 C2 loc L1 Partitioned P1 lab L1 P2 lab L3 C3 loc L1 C4 loc L2 P3 lab L2 P4 lab L1 dataset hash on L hash on L hash on L hash on L P1 lab L1 P4 lab L1 C3 loc L1 C2 loc L1 Data transfers = sum of repartitioned datasets Join on L1 Join on L2 Join on L3 P3 lab L2 at Aix Result is P1 lab L1 at Poitiers P3 lab L2 at Toulon P2 lab L3 at Paris partitioned on L P4 lab L1 at Poitiers P4 lab L1 at Poitiers BDA 2016 21

  22. Cost of join(2/2) Broadcast Join Larger target dataset Smaller broadcast dataset Part 1 Part n Part 1 Part n L1 at Poitiers L2 at Toulon P1 lab L1 P2 lab L3 L2 at Aix L3 at Paris P3 lab L2 P4 lab L1 Data transfers = Small dataset * nb of compute nodes Join on L Join on L P1 lab L1 at Poitiers P2 lab L3 at Paris Result preserves the target partitioning P3 lab L2 at Aix P4 lab L1 at Poitiers P3 lab L2 at Toulon BDA 2016 22

  23. Proposed Solution: Hybrid join plan • Cost Model for Pjoin and BrJoin • Aware of data partitioning, number of compute nodes • Size of intermediate results • Handle plans of star patterns • Star = local Pjoin  Get a linear join plan of stars • Often with successive BrJoins between selective stars • Build plan at runtime • Get size of intermediate results BDA 2016 23

  24. Build Hybrid join plan 1) Compute all stars: S 1 , S 2 ,… • S i = Pjoin (t1, t2, …) 2) Join 2 stars, say S i with S j • Ensure cost(S i ⨝ S j ) is minimal  get Si, Sj and a join algorithm • Let Temp = S i ⨝ S j 3) Continue with a 3 rd star, say S k • Ensure cost(Temp ⨝ S k ) is minimal a nd so on … BDA 2016 24

  25. SPARQL on Spark: Qualitative comparison Method co-partitioning Join plan Merged selection Query Data Optimizer Compression SPARQL Pjoin RDD SPARQL DF v 1.5 Pjoin,BrJoin1 poor Spark SPARQL SQL v 1.5 Pjoin,BrJoin1 cross prod interface Hybrid RDD Pjoin,BrJoin + cost based Hybrid DF Pjoin,BrJoin + cost based Our solutions supported not supported BDA 2016 25

  26. Experimental validation: setup • Datasets Dataset Name Nb of triples Description DrugBank 500K Real dataset LUBM 1.3B Synthetic data, LeHigh Univ WatDiv 1.1B Synthetic data, Waterloo Univ • Cluster • 17 compute nodes • Resource per node: 12 cores x 2 hyperthreads, 64 GB memory • 1Gb/s interconnect • Spark • 16 worker nodes • Aggregated resources: 300 cores, 800 GB memory • Solution • Implem written in scala, see companion website BDA 2016 26

  27. Experiments: Performance gain • Response time for Snowflake Q8 query from LUBM • 2 dataset sizes: medium (100M triples), large (1B triples)  Achieve higher gain for larger datasets No compression: 4,7 times faster Compressed data 3 times faster Dataset size BDA 2016 27

  28. Thanks for your attention Questions ? BDA 2016 28

  29. Extra slides BDA 2016 29

Recommend


More recommend