A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf @olafhartig Frank Huber Database and Information Systems Research Group Humboldt-Universität zu Berlin
The Issue 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 no reuse given order ContactInfoPhillipe (Query No. 36) UnsetPropsPhillipe (Query No. 37) 2ndDegree1Phillipe (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2
The Issue 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 Descriptor objects no reuse given order in the query-local ContactInfoPhillipe dataset after (Query No. 36) query execution: UnsetPropsPhillipe 172 (Query No. 37) 533 2ndDegree1Phillipe (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3
query-local Logical representation of dataset Linked Data from the Web Physical representation of Linked Data from the Web ? What data structure do we use to physically represent the query-local dataset? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4
Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5
Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add , Remove , Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non -relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6
Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add , Remove , Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non -relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7
Existing Work ● Disk based storage solutions for RDF data ● Unsuitable due to very costly I/O operations ● Main memory based data structures in the literature ● Focus on a large, single set of RDF triples ● Optimized for complete graph pattern queries or path queries ● Main memory based data structures in RDF frameworks ● Focus on Jena, ARQ and NG4J ● Inefficient (see evaluation) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8
Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9
Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10
Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t id = ( id s ,id p ,id o ) all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11
Hash-Based Index for RDF Data Logical representation Find ?acq knows Physical representation http://bob.name SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t id = ( id s ,id p ,id o ) all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12
Individual Indexing query-local dataset Logical representation Physical representation SP PO SO SP PO SO Dict S P O S P O SP PO SO ● Idea: Index each descriptor object separately S P O ● Implementation of the four operations: ● Add , Remove , and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13
Individual Indexing Find ?acq knows query-local http://bob.name dataset Logical representation Physical representation SP PO SO SP PO SO Dict S P O S P O SP PO SO ● Idea: Index each descriptor object separately S P O ● Implementation of the four operations: ● Add , Remove , and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14
Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15
Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs t id = ( id s ,id p ,id o ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16
Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs t id = ( id s ,id p ,id o ) + src ( t id ) = { , } Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17
Quad Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single quad index for all descriptor objects Dict S P O ● quad = ID-encoded triple + descriptor object ID q = ( (id s ,id p ,id o ) , ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18
Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19
Experiment Setup Does this affect the overall execution time for link traversal based query executions ? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20
Experiment Setup Does this affect the overall execution time for link traversal based query executions ? ● Simulation of the Web of Data ● Linked Data server publishes BSBM dataset (scal. factor: 50) ● Adjusted BSBM queries link to the simulation server ● Experiment: ● Sequence of 200 query mixes ● Reuse of the query-local dataset for the whole sequence ● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN ● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21
Execution Time 80 2500 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) IndIR, m=4 2000 CombIR, m=12 60 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120160200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22
Execution Time 80 2500 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) IndIR, m=4 2000 CombIR, m=12 60 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120160200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23
Summary ● Three hash index based data structures: ● Individually indexing ● Combined indexing ● Quad indexing ● Findings: ● A single index improves query performance significantly ● Smaller load times with quads ● Also for other use cases of ad hoc storing of Linked Data ● Consecutively retrieved from remote sources ● Used for immediate local processing Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24
Backup Slides Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
Recommend
More recommend