SPARQL Graph Pattern Processing with Apache Spark title ?P - PowerPoint PPT Presentation

SPARQL Graph Pattern Processing with Apache Spark title ?P speaker Hubert Naacke University author P. et M. Curie Paris 6 Olivier Curé Paris Est Marne-la-Vallée Bernd Amann GRADES 2017 1

Context • Big RDF data • Linked Open Data impulse: ever growing RDF content • Large datasets: billions of <subject, prop, object> triples • e.g. DBPedia • Query RDF data in SPARQL • The building block is a Basic Graph Pattern (BGP) query • e.g.: Snowflake pattern offers includes ? u Retail0 ? x from WatDiv benchmark Chain pattern t1 t2 t3 from LUBM benchmark type Course ?y ?z advisor teacherOf 2 GRADES 2017

Cluster computing platforms • Cluster computing platforms provide • main memory data management • distributed and parallel data access and processing • fault-tolerance, highly availability ➭ Leverage on existing platform • e.g. Apache Spark GRADES 2017 3

SPARQL on Spark Architecture SPARQL Graph Pattern query Our Hybrid Hybrid SPARQL SPARQL SPARQL DF RDD solutions SQL DF RDD SQL data compression GraphX DataFrame (DF) no compression Resilient Distributed Datastructures (RDD) Cluster ressource management Distributed File system RDF RDF triples triples 4 GRADES 2017

SPARQL query evaluation: Challenges • Requirements • Low memory usage: no data replication, no indexing • Fast data preparation: simple hash-based <Subject> partitioning • Challenges • Efficiently evaluate parallel and distributed join plans with Spark ➭ Favor local computation ➭ Reduce data transfers • Benefit from several join algorithms • Local partitioned join: no transfer • Distributed partitioned join • Broadcast join GRADES 2017 5

Solution • Local subquery evaluation • Merge multiple triple selections aka shared scan • Distributed query evaluation • Cost model for partitioned and broadcast joins • Generate Hybrid join plans, dynamic programming GRADES 2017 6

Hybrid plan : example and cost model Legend: SELECT * WHERE { Partitioned Broadcast B P Distribute Broadcast ⋈ ?x advisor ?y . ⋈ join join ?y teacherOf ?z . ?z type Course } Q9 1 Q9 2 Q9 3 P ⋈ y B P ⋈ y ⋈ z t1 t2 t3 ?y ?z advisor teacherOf type x t 3 x t 1 P t 1 B ⋈ z B Triple patterns of Q9 ⋈ y ⋈ z x y z y t 2 t 1 t 2 t 3 t 3 t 2 Plan cost : SPARQL RDD SPARQL DF SPARQL Hybrid with : plan plan plan C pattern = transferCost(pattern) cost(Q9 1 ) = cost(Q9 2 ) = cost(Q9 3 ) = θ comm is the unit tranfer cost C t1 + C t2 + C t2 ⨝ t3 m * (C t2 + C t3 ) C t1 + m * C t3 m = #computeNodes - 1

Performance comparison with S2RDF • S2RDF at VLDB 2016 • Same dataset (1B triples) & queries • Various query patterns: • S tar, snow F lake, C omplex Star Snowfake Complex ➭ One dataset: ➭ One dataset per property: <Subject> partitioning <Property> and <Subject> partitioning Hybrid DF accelerates DF Hybrid accelerates S2RDF up to 2.4 times up to 2.2 times GRADES 2017 8

Take home message • Existing cluster computing platforms are mature enough to process SPARQL queries at large scale. • To accelerate query plans: • Provide several distributed join algorithms • Allows for mixing several join algorithms More info at the poster session … Thank you. Questions ? GRADES 2017 9

Existing solutions • S2RDF (VLDB 2016) • Spark • Long data preparation time • Use a single join algo • CliqueSquare (ICDE 2015) • Hadoop platform • Data replicated 3 times: by subject, prop and object • AdPart (VLDBJ 2016) • Native distributed layer • "semi-join based" join algorithm • Distributed RDFox (ISWC 2016) • Native distributed layer • Data shuffling GRADES 2017 10

Conclusion • First detailed analysis of SPARQL processing on Spark • Cost model aware of data transfers • Efficient query plan generation • Optimality not studied (future works) • Extensive experiments at large scale • Future works: incorporate other recent join algo • Handle data bias • Hypercube n-way join: targets load balancing GRADES 2017 11

Thank you Questions ? GRADES 2017 12

Extra slides • GRADES 2017 13

Hybrid plan: Cost model Q9 1 Q9 3 Q9 2 Plan cost : P ⋈ y B P ⋈ y ⋈ z with : C pattern = θ comm * size(pattern) x t 3 x t 1 P t 1 θ comm is the unit tranfer cost B ⋈ z ⋈ y B ⋈ z m = #computeNodes - 1 x y z y t 2 t 1 t 2 t 3 t 3 t 2 SPARQL RDD SPARQL SQL SPARQL Hybrid plan plan plan cost(Q9 1 ) = cost(Q9 2 ) = cost(Q9 3 ) = C t1 + C t2 + C t2 ⨝ t3 m * (C t2 + C t3 ) C t1 + m * C t3

Data distribution (1/2) Hash-based partitioning Dataset ( subject , prop, object) s1 p1 o1 Partitioning is s2 p1 o2 s3 p1 o2 • Straightforward s1 p2 o3 • Simple map-reduce task s2 p3 o4 ... • No preparation overhead requirement • Hash-based partitioning on subject s1 p1 o1 s2 p1 o2 s3 p1 o2 s1 p2 o3 s2 p3 o4 Part 1 Part 2 Part N BDA 2016 15

Data distribution (2/2) over a cluster Compute node N node 2 node 1 Ressources: Piece of Memory data CPU Operation Memory Result Comm is expensive BDA 2016 16

Parallel and distributed data processing workflow (1/2) Compute node 1 node 2 node N Part Part Part Partitioned dataset 1 2 N Local (MAP) select select select Operation Result Result Result Partitioned Result 1 2 N Examples of local MAP operations:  selection, projection, join on subject BDA 2016 17

Parallel and distributed data processing workflow (2/2) Part Part Part Dataset 1 2 n Data transfers Global (REDUCE) Global Global Global Operation Operation Operation Operation Result Result Result 1 2 n Examples of global REDUCE operations : join , sort, distinct BDA 2016 18

Join processing wrt. query pattern Data: Star query: • Find laboratory and name of persons P1 lab L1 L1 at Poitiers P1 name Ali L1 since 2000 lab ?L P3 lab L2 L3 at Paris P3 name Clo L3 staff 200 name No transfer ?P ?N age P2 lab L3 L2 at Aix ?L P2 age 20 L2 at Toulon P2 name Bob L2 partner L1 Chain query: P4 lab L1 … • Find lab and its city for persons Transfer at lab ?P ?L ?V lab or at Snowflake ?n name ?s query: staff lab at ?P ?L ?V Complex query partner age ?N ?a BDA 2016 19

Join algorithms • Partitioned join (Pjoin) • Distribute data • Broadcast join (Brjoin) • Broadcast to all •  Hybrid join (contribution) • Distribute and/or broadcast • Based on a cost model BDA 2016 20

Cost of Join (1/2) Partitioned join Part 1 Part n Part 1 Part n C1 loc L3 C2 loc L1 Partitioned P1 lab L1 P2 lab L3 C3 loc L1 C4 loc L2 P3 lab L2 P4 lab L1 dataset hash on L hash on L hash on L hash on L P1 lab L1 P4 lab L1 C3 loc L1 C2 loc L1 Data transfers = sum of repartitioned datasets Join on L1 Join on L2 Join on L3 P3 lab L2 at Aix Result is P1 lab L1 at Poitiers P3 lab L2 at Toulon P2 lab L3 at Paris partitioned on L P4 lab L1 at Poitiers P4 lab L1 at Poitiers BDA 2016 21

Cost of join(2/2) Broadcast Join Larger target dataset Smaller broadcast dataset Part 1 Part n Part 1 Part n L1 at Poitiers L2 at Toulon P1 lab L1 P2 lab L3 L2 at Aix L3 at Paris P3 lab L2 P4 lab L1 Data transfers = Small dataset * nb of compute nodes Join on L Join on L P1 lab L1 at Poitiers P2 lab L3 at Paris Result preserves the target partitioning P3 lab L2 at Aix P4 lab L1 at Poitiers P3 lab L2 at Toulon BDA 2016 22

Proposed Solution: Hybrid join plan • Cost Model for Pjoin and BrJoin • Aware of data partitioning, number of compute nodes • Size of intermediate results • Handle plans of star patterns • Star = local Pjoin  Get a linear join plan of stars • Often with successive BrJoins between selective stars • Build plan at runtime • Get size of intermediate results BDA 2016 23

Build Hybrid join plan 1) Compute all stars: S 1 , S 2 ,… • S i = Pjoin (t1, t2, …) 2) Join 2 stars, say S i with S j • Ensure cost(S i ⨝ S j ) is minimal  get Si, Sj and a join algorithm • Let Temp = S i ⨝ S j 3) Continue with a 3 rd star, say S k • Ensure cost(Temp ⨝ S k ) is minimal a nd so on … BDA 2016 24

SPARQL on Spark: Qualitative comparison Method co-partitioning Join plan Merged selection Query Data Optimizer Compression SPARQL Pjoin RDD SPARQL DF v 1.5 Pjoin,BrJoin1 poor Spark SPARQL SQL v 1.5 Pjoin,BrJoin1 cross prod interface Hybrid RDD Pjoin,BrJoin + cost based Hybrid DF Pjoin,BrJoin + cost based Our solutions supported not supported BDA 2016 25

Experimental validation: setup • Datasets Dataset Name Nb of triples Description DrugBank 500K Real dataset LUBM 1.3B Synthetic data, LeHigh Univ WatDiv 1.1B Synthetic data, Waterloo Univ • Cluster • 17 compute nodes • Resource per node: 12 cores x 2 hyperthreads, 64 GB memory • 1Gb/s interconnect • Spark • 16 worker nodes • Aggregated resources: 300 cores, 800 GB memory • Solution • Implem written in scala, see companion website BDA 2016 26

Experiments: Performance gain • Response time for Snowflake Q8 query from LUBM • 2 dataset sizes: medium (100M triples), large (1B triples)  Achieve higher gain for larger datasets No compression: 4,7 times faster Compressed data 3 times faster Dataset size BDA 2016 27

Thanks for your attention Questions ? BDA 2016 28

Extra slides BDA 2016 29

SPARQL Graph Pattern Processing with Apache Spark title ?P - PowerPoint PPT Presentation

SPARQL Graph Pattern Processing with Apache Spark title ?P speaker Hubert Naacke University author P. et M. Curie Paris 6 Olivier Cur Paris Est Marne-la-Valle Bernd Amann GRADES 2017 1 Context Big RDF data Linked Open Data

Lecture 3: SPARQL (1.1) Aidan Hogan aidhog@gmail.com PREVIOUSLY First SPARQL (1.0) Then

Dydra define-declaration http:/ /dydra.com or ... don't walk sbcl 20 a sparql service ..

The Tractability Frontier of Well-designed SPARQL Queries Miguel Romero (University of Oxford)

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini marco.balduini@polimi.it Share,

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini marco.balduini@polimi.it Share,

Introduction to SPARQL Acknowledgements This presentation is based on the W3C Candidate

Knowledge Representation VII - IKT507 This sentence is false! SPARQL stands for SPARQL Protocol

From SPARQL to Rules (and back) Axel Polleres 1 1 DERI Galway, National University of Ireland,

SPARQL 1.1 Peter Fischer DMQL SPARQL 1.0 limitations Limited graphs operations: How to

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

A common pattern: map Another common pattern: filter Pattern: take a list and produce a new list,

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Universal Shape Formation for Programmable Matter (Thim Strothmann) Joint work with BDA 2016

Big Data Analytics 3 rd NESUS Winter School on Data Science & Heterogeneous Computing

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

FIGHTING FOOD WASTE Caitrin OBrien Senior Manager, Corporate Sustainability HILTONS 2030

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus supplement about memos

Bayesian Generative Active Deep Learning Toan Tran 1 Thanh-Toan Do 2 Ian Reid 1 Gustavo Carneiro 1

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

SPARQL Graph Pattern Processing with Apache Spark title ?P - PowerPoint PPT Presentation

SPARQL Graph Pattern Processing with Apache Spark title ?P speaker Hubert Naacke University author P. et M. Curie Paris 6 Olivier Cur Paris Est Marne-la-Valle Bernd Amann GRADES 2017 1 Context Big RDF data Linked Open Data

Lecture 3: SPARQL (1.1) Aidan Hogan aidhog@gmail.com PREVIOUSLY First SPARQL (1.0) Then

Dydra define-declaration http:/ /dydra.com or ... don't walk sbcl 20 a sparql service ..

The Tractability Frontier of Well-designed SPARQL Queries Miguel Romero (University of Oxford)

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini marco.balduini@polimi.it Share,

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini marco.balduini@polimi.it Share,

Introduction to SPARQL Acknowledgements This presentation is based on the W3C Candidate

Knowledge Representation VII - IKT507 This sentence is false! SPARQL stands for SPARQL Protocol

From SPARQL to Rules (and back) Axel Polleres 1 1 DERI Galway, National University of Ireland,

SPARQL 1.1 Peter Fischer DMQL SPARQL 1.0 limitations Limited graphs operations: How to

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

A common pattern: map Another common pattern: filter Pattern: take a list and produce a new list,

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Universal Shape Formation for Programmable Matter (Thim Strothmann) Joint work with BDA 2016

Big Data Analytics 3 rd NESUS Winter School on Data Science &amp; Heterogeneous Computing

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

FIGHTING FOOD WASTE Caitrin OBrien Senior Manager, Corporate Sustainability HILTONS 2030

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Syllabus link 1 syllabus BDA17 Syllabus version C 2 Syllabus supplement about memos

Bayesian Generative Active Deep Learning Toan Tran 1 Thanh-Toan Do 2 Ian Reid 1 Gustavo Carneiro 1

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Big Data Analytics 3 rd NESUS Winter School on Data Science & Heterogeneous Computing