sparqling pig sparqling pig
play

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - PowerPoint PPT Presentation

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015 Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD


  1. SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015

  2. Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD "hdfs:///data.csv" AS (id, x: int, y: int); b = FILTER a BY x > 10 AND y < 50; process large datasets c = FOREACH b GENERATE id; access data in HDFS STORE c INTO "hdfs:///data2.csv"; Linked Data / RDF Linked Data / RDF connect information / datasets <event1> <type> <Concert> . triple format <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2"^^<xsd:double> . <event1> <geo:lat> "5.353E1"^^<xsd:double> . (subject, predicate, object) <event1> <artist> <Metallica> . <Metallica> <type> <Band> . query language: SPARQL <Metallica> <name> "Metallica" . <Metallica> <founded> "1981"^^<xsd:integer> . federated query processing SELECT ?x ?y WHERE { ?event <long> ?x. ?event <lat> ?y. ?event <artist> ?artist . ?artist <name> "Metallica" . } 1

  3. Motivation Motivation Big Data Linked Data (tuples) (triples) combined execution model existing solutions: SPARQL to Pig , e.g.: Alexander Schätzle et. al., PigSPARQL: Übersetzung von SPARQL nach PigLatin , BTW 2011 2

  4. Motivation Motivation Problems Problems no BGPs in Pig self joins to reconstruct entities load dataset twice (or more) COGROUP : combination of MapReduce jobs Contribution Contribution Pig Latin language extension data model add SPARQL-like features not only one dataset + access remote data efficient processing of Linked Data in Pig results as foundation for cost-based Pig compiler/rewriter 3

  5. Outline Outline 1. Data Model 2. Pig Extensions conversion load/access BGP support 3. Extended Pig Rewriting 4. Evaluation

  6. Data Model Data Model RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig fixed schema not flexible enough: ( s , p , ..., p ) ​ 1 n Our approach: Our approach: for each subject: bag of predicate-object pairs { subject: bytearray, stmts: { (predicate: bytearray, object: bytearray) } } <event1> <artist> <Metallica> . <event1>, { (<artist>,<Metallica>), <event1> <start> "2012-08-17T18:00" . (<start>,"2012-08-17T18:00"), <event1> <geo:long> "-1.135E2" . (<geo:long>,"-1.135E2"), <event1> <geo:lat> "5.353E1" . (<geo:lat>,"5.353E1"), <event1> <type> <Concert> . (<type>,<Concert>) }, <Metallica> <type> <Band> . <Metallica>,{ (<type>,<Band>), <Metallica> <name> "Metallica" . (<name>,"Metallica"), <Metallica> <founded> "1981" . (<founded>,"1981") } H. Kim, et al, From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , PVLDB 2011 4

  7. Data Model Data Model RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig fixed schema not flexible enough: ( s , p , ..., p ) ​ 1 n Our approach: Our approach: for each predicate: bag of subject-object pairs { predicate: bytearray, stmts: { (subject: bytearray, object: bytearray) } } <event1> <artist> <Metallica> . <type>, { (<Metallica>, <Band>), <event1> <start> "2012-08-17T18:00" . (<event1>,<Concert>) }, <event1> <geo:long> "-1.135E2" . <artist>, { (<event1>, <Metallica>) }, <event1> <geo:lat> "5.353E1" . <start>, { (<event1>, "2012-08-17T18:00") }, <event1> <type> <Concert> . <geo:lat>,{ (<event1>, "-1.135E2") }, <Metallica> <type> <Band> . <geo:lat>,{ (<event1>, "5.353E1") }, <Metallica> <name> "Metallica" . <name>, { (<Metallica>,"Metallica") }, <Metallica> <founded> "1981" . <founded>,{ (<Metallica>,"1981") } H. Kim, et al, From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , PVLDB 2011 4

  8. Pig Extensions Pig Extensions TUPLIFY TUPLIFY avoid self-joins convert plain triples to triple-bag format using GROUP BY on any component explicitly or implicitly (rewriting rules) triple_groups = TUPLIFY triples BY subject triple_groups = FOREACH (GROUP triples BY subject) GENERATE group AS subject, triples.(predicate, object) AS stmts; 5

  9. Pig Extensions Pig Extensions LOAD - local LOAD - local load plain N3 files is supported natively by Pig we use a UDF for tokenizing text lines to triples RDFFileLoad macro DEFINE RDFFileLoad(file) RETURNS T { lines = LOAD '$file' AS (txt: chararray); $T = FOREACH lines GENERATE FLATTEN(pig.RDFize(txt)) AS (subject, predicate, object); } triples = RDFFileLoad("hdfs:///rdf-data.nt"); load TUPLIFIED dataset using BinStorage rdf_tuples = LOAD "rdf-data.dat" USING BinStorage() AS (subject: bytearray, stmts: bag{t:(predicate: bytearray, object: bytearray)}); 6

  10. Pig Extensions Pig Extensions LOAD - remote LOAD - remote run SPARQL query on endpoints filter remote data beforehand depends on user query raw = LOAD "http://endpoint.org:8080/sparql" USING SPARQLLoader("SELECT * WHERE { ?s ?p ?o }") AS (subject, predicate, object); ("http://endpoint.org:8080/sparql", "SELECT * WHERE {?s ?p ?o}") --> "hdfs:///rdf-data.nt" raw = RDFFileLoad("hdfs:///rdf-data.nt"); materialize (remote) data share accross queries could be used for frequent intermediate results 7

  11. Pig Extensions Pig Extensions BGP Support BGP Support result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 }; extended FILTER operator hide internal details of BGP processing implementation depends on input schema implemented as language extension internal operators stay unchanged rewriting step in Pig parser - transformation to native Pig code Pig compiler for optimization 8

  12. Rewriting Rewriting Example - FILTER Example - FILTER ( s ,{( p , o )}) ′ ′ out ( s , {( p , o )}) = FILTER in BY { value ? p ? o }; ==> ( s ,{( p , o )}) ( s ,{( p , o )}) ′ ′ = FILTER in BY s == value ; out Example - FILTER (non-grouping component) Example - FILTER (non-grouping component) ( s ,{( p , o )}) ( s ,{( p , o )}) ′ ′ = FILTER in BY { ? s ? p value }; out ==> ( s ,{( p , o )}, cnt ) ( s ,{( p , o )}) = FOREACH in { tmp ′ ′ t = FILTER stms BY o == value ; GENERATE ∗ , COUNT ( t ) AS cnt ; }; ( s ,{( p , o )}, cnt ) ( s ,{( p , o )}, cnt ) = FILTER tmp BY cnt > 0; out 9

  13. Rewriting Rewriting Example - STAR JOIN Example - STAR JOIN (on grouping component) (on grouping component) ( s ,{( p , o )}) ( s ,{( p , o )}) = FILTER in BY { TP . ... TP . }; out 1 N ==> ( s ,{( p , o )}, cnt 1,..., cntN ) ( s ,{( p , o )}) = FOREACH in { tmp ′ ′ t 1 = FILTER stms BY p == p ; 1 ... ′ ′ tN = FILTER stms BY p == p ; N GENERATE ∗ , COUNT ( t 1) AS cnt 1, ... , COUNT ( tN ) AS cntN ; }; ( s ,{( p , o )}, cnt 1,..., cntN ) ( s ,{( p , o )}, cnt 1, ..., cntN ) = FILTER tmp out BY cnt 1 > 0 AND ... AND cntN > 0; tmp = FOREACH triples { result = FILTER triples BY t1 = FILTER stmts BY predicate == "<geo:lat>"; { ?s <geo:lat> ?o1 . t2 = FILTER stmts BY predicate == "<geo:long>"; ?s <geo:long> ?o2 }; GENERATE *, COUNT(t1) AS cnt1, COUNT(t2) AS cnt2; } result = FILTER tmp BY cnt1 > 0 AND cnt2 > 0; 10

  14. Evaluation Evaluation Self-Join Self-Join scripts manually rewritten triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY Dataset: 8GB, 54 mio statements { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 }; Hadoop Cluster: 8 Nodes, Pig 0.12 11

  15. Evaluation Evaluation FILTER (non-grouping component) FILTER (non-grouping component) triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY { ?s ?p "Metallica" }; 12

  16. Conclusion Conclusion native Pig data model not suitable for RDF data combination of self-joins and filter needed support for BGP in Pig Latin join with remote data rewriter produces native Pig code use Pig optimizer allows easier and faster linked data processing in Pig foundation for cost-based optimizer materialized (intermediate) results

More recommend