Compressed RDF: Practical Uses & Hands-on Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017
General agenda Session I (09:00 - 10:30) " Basics of Compression for Big Linked Data Management “ Big (Linked) Semantic Data Compression: motivation & challenges Compact Data Structures Session II (13:30 - 15:00) “ RDF Compression “ RDF Compression. HDT RDF Dictionaries RDF Triples Session III (15:30- 17:00) “ Compressed RDF: Practical Uses & Hands-on ” Practical Uses (LOD-a-lot, RDF Archiving, etc.) Hands on PAGE 2 images: zurb.com
Agenda of this session Practical uses LOD-a-lot: Web-scale queries in your pocket RDF archiving Linked Data markets (Linked Close Data) Hands on HDT-it Command line tools HDT and Fuseki HDT and Linked Data Fragments HDT and C++/Java HDT and Jena PAGE 3 images: zurb.com
Use case 1 LOD-a-lot
Still… what about Web -scale queries E.g. retrieve all entities in LOD with the label “Axel Polleres “ select distinct ?x { ?x rdfs:label “Axel Polleres" } Options: Crawl and index LOD locally (-no-) Follow-your-nose (where should I start?) Federated querying (as good as the endpoints you query) Use LOD Laundromat as a “good approximation” (still querying 650K datasets) 5
LOD Laundromat Linked Open Data SPARQL LOD endpoint Laundromat (metadata) Dataset 1 Dataset 650K N-Triples N-Triples (zip) (zip) 6
But what about Web-scale queries LOD-a-lot - flashback - 7
The real motivation consume
The real motivation Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking Article/413995/serving-the-masses/ http://www.kunsan.af.mil/News/ consume
The real motivation Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking Article/413995/serving-the-masses/ http://www.kunsan.af.mil/News/ consume
But what about Web-scale queries But one could be really hungry LOD-a-lot https://hwy55burgers.wordpress.com/tag/food-challenge/ 11
LOD-a-lot Linked Open Data SPARQL LOD endpoint Laundromat (metadata) Dataset 1 Dataset 650K N-Triples N-Triples (zip) (zip) LOD-a-lo lot 28B triples 12 Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias
LOD-a-lot (some numbers) Disk size: HDT: 304 GB HDT-FoQ (additional indexes): 133 GB 305 € Memory footprint (to query): 15.7 GB of RAM (3% of the size) 144 seconds loading time 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS LDF page resolution in milliseconds. (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM) 13
http://purl.org/HDT/lod-a-lot LOD-a-lot https://datahub.io/dataset/lod-a-lot 14
LOD-a-lot (some use cases) Query resolution at Web scale Evaluation and Benchmarking No excuse RDF metrics and analytics subjects predicates objects 15
ACKs LOD-a-lot 16
Use case 2 Archiving
So far so good... But RDF is evolving Update rate Virtual/Augmented Internet Reality second of Things minute hour day week Dyldo versions? LOD-a-lot month DBpedia BTC year Number ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, of STREAM REASONING WORKSHOP 2015 10 0 10 1 10 2 10 3 10 4 10 5 10 6 sources
Linked Data Archives: The missing link in the RDF evolution Most semantic Web/Linked Data tools are focused on this “ static view ” but do not consider versioning/evolution Sindice, SWSE, Swoogle, LOD Cache, LOD-Laundromat … so far, no versions! 3
Preservation matters Web archives: Common Crawl, Internet Memory, Internet Archive, … 20
…in the last few years: RDF evolution at Scale one of the fundamental problems in the Web of Data Research projects Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7) Archives Tools v-RDFCSA Benchmarking BEnchmark of RDF ARchives 21
…in the last few years: RDF evolution at Scale one of the fundamental problems in the Web of Data Research projects Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7) Archives Tools v-RDFCSA Benchmarking BEnchmark of RDF ARchives 22
RDF Archiving. Archiving policies a) Independent Copies/Snapshots (IC) RETRIEVAL MEDIATOR c) Timestamp-based approach (TB) V 1 V 2 V 3 RETRIEVAL MEDIATOR ex:C1 ex:hasProfessor ex:P1 . ex:C1 ex:hasProfessor ex:P1 . ex:C1 ex:hasProfessor ex:P2 . ex:S1 ex:study ex:C1 . ex:S1 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:S2 . ex:S2 ex:study ex:C1 . ex:S3 ex:study ex:C1 . ex:S1 ex:study ex:C1 . V 1,2, ex:S3 ex:study ex:C1 . b) Change-based approach (CB) 3 ex:C1 ex:hasProfessor ex:P1 [V 1 ,V 2 ]. ex:C1 ex:hasProfessor ex:P2 [V 3 ]. ex:C1 ex:hasProfessor ex:S2 [V 3 ]. ex:S1 ex:study ex:C1 [V 1 ,V 2 ,V 3 ]. ex:S2 ex:study ex:C1 [V 1 ]. ex:S2 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P2 . ex:S3 ex:study ex:C1 [V 2 ,V 3 ]. ex:C1 ex:hasProfessor ex:S2 . V 1 RETRIEVAL MEDIATOR ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S2 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P1 . ex:S3 ex:study ex:C1 . 23
BEAR https://aic.ai.wu.ac.at/qadlod/bear.html 24
BEAR: Benchmarking the Efficiency of RDF Archiving Queries and systems We implemented and evaluate archiving systems on Jena-TDB and HDT, based on IC, CB and TB policies. Serve as an initial baseline to compare archiving systems More info: https://aic.ai.wu.ac.at/qadlod/bear.html 25
RDF Archiving. Archiving policies a) Independent Copies/Snapshots (IC) RETRIEVAL MEDIATOR c) Timestamp-based approach (TB) V 1 V 2 V 3 RETRIEVAL MEDIATOR ex:C1 ex:hasProfessor ex:P1 . ex:C1 ex:hasProfessor ex:P1 . ex:C1 ex:hasProfessor ex:P2 . ex:S1 ex:study ex:C1 . ex:S1 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:S2 . ex:S2 ex:study ex:C1 . ex:S3 ex:study ex:C1 . ex:S1 ex:study ex:C1 . V 1,2, ex:S3 ex:study ex:C1 . b) Change-based approach (CB) 3 ex:C1 ex:hasProfessor ex:P1 [V 1 ,V 2 ]. ex:C1 ex:hasProfessor ex:P2 [V 3 ]. ex:C1 ex:hasProfessor ex:S2 [V 3 ]. ex:S1 ex:study ex:C1 [V 1 ,V 2 ,V 3 ]. ex:S2 ex:study ex:C1 [V 1 ]. ex:S2 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P2 . ex:S3 ex:study ex:C1 [V 2 ,V 3 ]. ex:C1 ex:hasProfessor ex:S2 . V 1 RETRIEVAL MEDIATOR ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S2 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P1 . ex:S3 ex:study ex:C1 . 26
Benchmarking: Define the queries Instantiation of archive queries in AnQL [1] Mat(Q,V1) SELECT * WHERE { Q :[v1] } version materialization Diff(Q,V1,V2) Ver(Q) join(Q1,vi,Q2,vj) Change(Q) [1] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data . Journal of Web Semantics (JWS), 12:72--95, March 2012. 27
Benchmarking: Define the queries Instantiation of archive queries in AnQL Mat(Q,V1) SELECT * WHERE { Diff(Q,V1,V2) { { {Q :[v1]} MINUS {Q :[v2]} } BIND (v1 AS ?V ) } delta materialization UNION { { {Q :[v2] } MINUS {Q :[v1]}} BIND (v2 AS ?V ) Ver(Q) } join(Q1,vi,Q2,vj) Change(Q) 28
Benchmarking: Define the queries Instantiation of archive queries in AnQL Mat(Q,V1) Diff(Q,V1,V2) Ver(Q) SELECT * WHERE { Q :?V } results of Q annotated with the version join(Q1,vi,Q2,vj) Change(Q) 29
Benchmarking: Define the queries Instantiation of archive queries in AnQL Mat(Q,V1) Diff(Q,V1,V2) Ver(Q) join(Q1,v1,Q2,v2) SELECT * WHERE { {Q :[v1]} {Q :[v2]} } Change(Q) 30
Benchmarking: Define the queries Instantiation of archive queries in AnQL Open question remains: What is the right query syntax for archive queries? Mat(Q,V1) SELECT ?V1 ?V2 WHERE Diff(Q,V1,V2) { {{Q :?V1 } MINUS {Q :?V2}} UNION Ver(Q) {{Q :?V2 } MINUS {Q :?V1}} join(Q1,vi,Q2,vj) FILTER( abs(?V1-?V2) = 1 ) } Change(Q) Returns consecutive versions in which Diff of a query is not null 31
Time-based access. Queries Materialize (s,?,? ; version) 32
Time-based access. Queries diff(?,?,o ; version0 ; version t) 33
Recommend
More recommend