a main memory index structure to query linked data
play

A Main Memory Index Structure to Query Linked Data Olaf Hartig - PowerPoint PPT Presentation

A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf @olafhartig Frank Huber Database and Information Systems Research Group Humboldt-Universitt zu Berlin The Issue 0 0,2 0,4 0,6 0,8 1 0 5


  1. A Main Memory Index Structure to Query Linked Data Olaf Hartig http://olafhartig.de/foaf.rdf#olaf @olafhartig Frank Huber Database and Information Systems Research Group Humboldt-Universität zu Berlin

  2. The Issue 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 no reuse given order ContactInfoPhillipe (Query No. 36) UnsetPropsPhillipe (Query No. 37) 2ndDegree1Phillipe (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2

  3. The Issue 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 Descriptor objects no reuse given order in the query-local ContactInfoPhillipe dataset after (Query No. 36) query execution: UnsetPropsPhillipe 172 (Query No. 37) 533 2ndDegree1Phillipe (Query No. 38) 2ndDegree2Phillipe (Query No. 39) IncomingPhillipe (Query No. 40) 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80 hit rate number of query results query execution time (in seconds) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3

  4. query-local Logical representation of dataset Linked Data from the Web Physical representation of Linked Data from the Web ? What data structure do we use to physically represent the query-local dataset? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4

  5. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5

  6. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add , Remove , Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non -relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6

  7. Requirements ● (Consecutively) build and use ad hoc collections of many small sets of RDF triples ● Four main operations: ● Find … matching triples for a triple pattern in all descriptor objects ● Add , Remove , Replace … descriptor objects ● Support of concurrent access (i.e. isolation) ● Non -relevant properties: ● Querying descriptor objects individually is not necessary ● No need to write data back to the Web ● ACID properties not required for complete queries Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7

  8. Existing Work ● Disk based storage solutions for RDF data ● Unsuitable due to very costly I/O operations ● Main memory based data structures in the literature ● Focus on a large, single set of RDF triples ● Optimized for complete graph pattern queries or path queries ● Main memory based data structures in RDF frameworks ● Focus on Jena, ARQ and NG4J ● Inefficient (see evaluation) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8

  9. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9

  10. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10

  11. Hash-Based Index for RDF Data Logical representation Physical representation SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t id = ( id s ,id p ,id o ) all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11

  12. Hash-Based Index for RDF Data Logical representation Find ?acq knows Physical representation http://bob.name SP PO SO ● Dictionary: ● Two-way mapping between RDF Dict terms and numerical identifiers S P O ● 6 hash tables: ● Each hash table contains t id = ( id s ,id p ,id o ) all ID-encoded triples ● Efficient support for all types of triple patterns *Similar to Harth and Decker, 2005 Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12

  13. Individual Indexing query-local dataset Logical representation Physical representation SP PO SO SP PO SO Dict S P O S P O SP PO SO ● Idea: Index each descriptor object separately S P O ● Implementation of the four operations: ● Add , Remove , and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13

  14. Individual Indexing Find ?acq knows query-local http://bob.name dataset Logical representation Physical representation SP PO SO SP PO SO Dict S P O S P O SP PO SO ● Idea: Index each descriptor object separately S P O ● Implementation of the four operations: ● Add , Remove , and Replace are straightforward ● Find requires iterating over all indexes Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14

  15. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15

  16. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs t id = ( id s ,id p ,id o ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16

  17. Combined Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single index for all descriptor objects Dict S P O ● src – maps each triple to a set of descriptor object IDs t id = ( id s ,id p ,id o ) + src ( t id ) = { , } Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17

  18. Quad Indexing query-local dataset Logical representation Physical representation SP PO SO ● Idea: Use a single quad index for all descriptor objects Dict S P O ● quad = ID-encoded triple + descriptor object ID q = ( (id s ,id p ,id o ) , ) Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18

  19. Outline 1. Requirements + Existing Work 2. Data Structures 3. Evaluation Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19

  20. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20

  21. Experiment Setup Does this affect the overall execution time for link traversal based query executions ? ● Simulation of the Web of Data ● Linked Data server publishes BSBM dataset (scal. factor: 50) ● Adjusted BSBM queries link to the simulation server ● Experiment: ● Sequence of 200 query mixes ● Reuse of the query-local dataset for the whole sequence ● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN ● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21

  22. Execution Time 80 2500 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) IndIR, m=4 2000 CombIR, m=12 60 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120160200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22

  23. Execution Time 80 2500 overall number of descr.objects in the queried dataset NG4J (SWClLib 70 ) IndIR, m=4 2000 CombIR, m=12 60 CombQuadIR, execution time in seconds m=12 50 1500 40 1000 30 20 500 10 0 0 0 40 80 120160200 0 20 40 60 80 100 120 140 160 180 200 query mix query mix Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23

  24. Summary ● Three hash index based data structures: ● Individually indexing ● Combined indexing ● Quad indexing ● Findings: ● A single index improves query performance significantly ● Smaller load times with quads ● Also for other use cases of ad hoc storing of Linked Data ● Consecutively retrieved from remote sources ● Used for immediate local processing Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24

  25. Backup Slides Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25

Recommend


More recommend