Compressed Indexes for Fast Search of Semantic Data Ra ff aele - PowerPoint PPT Presentation

Compressed Indexes for Fast Search   of Semantic Data Ra ff aele Perego Giulio Ermanno Pibiri Rossano Venturini ISTI-CNR   ISTI-CNR   The University of Pisa   Pisa, Italy Pisa, Italy Pisa, Italy The 10-th Italian Information Retrieval Workshop (IIR 2019) 17/09/2019

Resource Description Framework (RDF) “RDF is a standard model for data interchange on the Web.” Source: https://www.w3.org/RDF Statements are encoded with triples :   Subject ( S ) - Predicate ( P ) - Object ( O )

Resource Description Framework (RDF) “RDF is a standard model for data interchange on the Web.” Source: https://www.w3.org/RDF Statements are encoded with triples :   Subject ( S ) - Predicate ( P ) - Object ( O ) “Bob Smith knows John Doe.” <http://example.name#BobSmith12> <http://xmlns.com/foaf/0.1/knows> <http://example.name#JohnDoe34>

The problem Huge datasets: billions of triples. Storage space is an issue:   compression is mandatory . How to support triple selection patterns (with wildcards) e ffi ciently ?

The problem Huge datasets: billions of triples. Storage space is an issue:   compression is mandatory . How to support triple selection patterns (with wildcards) e ffi ciently ? <Bob Smith> <knows> <???> <???> <???> John Doe <Bob Smith> <???> <Sara Parker>

The problem Huge datasets: billions of triples. Storage space is an issue:   compression is mandatory . How to support triple selection patterns (with wildcards) e ffi ciently ? 1 wildcard:   2 wildcards:   <Bob Smith> <knows> <???> SP?   S??   S?O   ?P?   <???> <???> John Doe ?PO ??O <Bob Smith> <???> <Sara Parker> 3 wildcards:   0 wildcard:   ??? SPO

State-of-the-art solutions Too costly in terms of space . • Materialize all possible S-P-O permutations (6 separate indexes).   • Do not use sophisticated compression techniques.   • Expensive additional indexes to support retrieval.

  The Permuted Trie Index: preliminaries Map URI strings to integers to reduce space requirements:   we deal with datasets of integer triples. Selection patterns S P O   S P ?   S ? ?   ? ? ?   ? P O   ? P ? S ? O   ? ? O  

  The Permuted Trie Index: preliminaries Map URI strings to integers to reduce space requirements:   we deal with datasets of integer triples. Selection patterns S P O   S P ?   S-P-O order S ? ?   ? ? ?   ? P O   ? P ? S ? O   ? ? O  

  The Permuted Trie Index: preliminaries Map URI strings to integers to reduce space requirements:   we deal with datasets of integer triples. Selection patterns S P O   S P ?   S-P-O order S ? ?   ? ? ?   ? P O   P-O-S order ? P ? S ? O   ? ? O  

  The Permuted Trie Index: preliminaries Map URI strings to integers to reduce space requirements:   we deal with datasets of integer triples. Selection patterns S P O   S P ?   S-P-O order S ? ?   ? ? ?   ? P O   P-O-S order ? P ? S ? O   O-S-P order ? ? O  

  The Permuted Trie Index: preliminaries Map URI strings to integers to reduce space requirements:   we deal with datasets of integer triples. Selection patterns S P O   S P ?   S-P-O order S ? ?   ? ? ?   Store an integer trie data structure ? P O   P-O-S order ? P ? for each permutation. S ? O   O-S-P order ? ? O  

The Permuted Trie Index: organisation 0 1 2 3 4

The Permuted Trie Index: organisation 0 1 2 3 4 • Common prefixes are encoded once.   • Two integer sequences per level (nodes and pointers).   • Symmetrically support all selection patterns with 1 and 2 wildcards.   • Cache-friendly memory layout.

The Permuted Trie Index: organisation 0 1 2 3 4 • Common prefixes are encoded once.   Allows e ff ective   compression • Two integer sequences per level (nodes and pointers).   • Symmetrically support all selection patterns with 1 and 2 wildcards.   • Cache-friendly memory layout.

The Permuted Trie Index: organisation 0 1 2 3 4 • Common prefixes are encoded once.   Allows e ff ective   compression • Two integer sequences per level (nodes and pointers).   • Symmetrically support all selection patterns with 1 and 2 wildcards.   Fast retrieval • Cache-friendly memory layout.

The Permuted Trie Index: organisation 0 1 2 3 4 (1, 2, ?) • Common prefixes are encoded once.   Allows e ff ective   compression • Two integer sequences per level (nodes and pointers).   • Symmetrically support all selection patterns with 1 and 2 wildcards.   Fast retrieval • Cache-friendly memory layout.

The Permuted Trie Index: organisation 0 1 2 3 4 (1, 2, ?) (1, 2, 0) (1, 2, 1) • Common prefixes are encoded once.   Allows e ff ective   compression • Two integer sequences per level (nodes and pointers).   • Symmetrically support all selection patterns with 1 and 2 wildcards.   Fast retrieval • Cache-friendly memory layout.

The Permuted Trie Index: refinements 1 Cross Compression 2 Permutation Elimination

Cross Compression Fact: the same triple appears three times, but in di ff erent permutations.

Cross Compression Fact: the same triple appears three times, but in di ff erent permutations. We can represent the subjects in trie 1 by using the subjects in trie 2 .

Cross Compression Fact: the same triple appears three times, but in di ff erent permutations. We can represent the subjects in trie 1 by using the subjects in trie 2 . P O i O i S 1 … S j … S n S 1 … S j … S n

Cross Compression Fact: the same triple appears three times, but in di ff erent permutations. We can represent the subjects in trie 1 by using the subjects in trie 2 . Represent S j as its position p . P O i O i S 1 … S j … S n p S 1 … S j … S n

Cross Compression Fact: the same triple appears three times, but in di ff erent permutations. We can represent the subjects in trie 1 by using the subjects in trie 2 . Represent S j as its position p . Why? P O i O i S 1 … S j … S n p S 1 … S j … S n Number of children in Dbpedia.

Permutation Elimination Fact: predicates are few , thus S?O returns only few matches.

  Permutation Elimination Fact: predicates are few , thus S?O returns only few matches. We can pattern match S?O on the SPO trie , instead of the OSP trie.   Given a ( s , o ) pair: for each child p i of s ,   check is o is a child of p i . If so, then ( s , p i , o ) is a match.

  Permutation Elimination Fact: predicates are few , thus S?O returns only few matches. We can pattern match S?O on the SPO trie , instead of the OSP trie.   Given a ( s , o ) pair: for each child p i of s ,   check is o is a child of p i . If so, then ( s , p i , o ) is a match. Less than 6 checks are needed on average! Number of children in Dbpedia.

Permutation Elimination SPO trie S P O   S P ?   S ? ? S ? O   ? ? ? + OR

Permutation Elimination SPO trie S P O   S P ?   S ? ? S ? O   ? ? ? + OR OPS trie ? P O   ? ? O ? P ? Object-based retrieval

Permutation Elimination SPO trie S P O   S P ?   S ? ? S ? O   ? ? ? + OR OPS trie POS trie ? P O   ? P O   ? ? O ? ? O ? P ? ? P ? Object-based retrieval Predicate-based retrieval

Permutation Elimination SPO trie S P O   S P ?   S ? ? S ? O   ? ? ? + OR OPS trie POS trie ? P O   ? P O   ? ? O ? ? O ? P ? ? P ? Object-based retrieval Predicate-based retrieval We can eliminate a permutation, thus saving 1/3 of the space of the index.

Experiments: setting Datasets Machine i7-7700 CPU (@3.6 GHz), 64 GB of RAM DDR3 (@2.133 GHz)   Linux 4.4.0, 64 bits Compiler gcc 7.2.0 (with all optimizations)

Experiments: C++ code C++ code at https://github.com/jermp/rdf_indexes

Experiments: our solutions Overall, 2Tp o ff ers the best space/time tradeo ff .

Experiments: overall comparison Our selected trade-o ff configuration substantially outperforms the tested   competitors in both space and time.

Conclusions The triple indexing problem with pattern matching can be solved efficiently in both time and space regards. Our solution — the permuted trie index — achieves substantial performance improvement against the best previous solutions. Cross-compression Permutation-elimination C++ code available at Paper available at https://github.com/jermp/rdf_indexes https://arxiv.org/abs/1904.07619

Thanks for your attention, time, patience! Any questions?

Compressed Indexes for Fast Search of Semantic Data Ra ff aele - PowerPoint PPT Presentation

Compressed Indexes for Fast Search of Semantic Data Ra ff aele Perego Giulio Ermanno Pibiri Rossano Venturini ISTI-CNR ISTI-CNR The University of Pisa Pisa, Italy Pisa, Italy Pisa, Italy The 10-th Italian Information Retrieval

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Fast Data Driven Compressed Sensing and application to compressed quantitative MRI Mike Davies

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je Wrocaw,

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

Semantic Full-Text Search Semantic Full Text Search Talk @ SIGIR JIWES Talk @ SIGIR

Fast-SCNN: Fast Semantic Segmentation Network Rudra PK Poudel Stephan Liwicki Roberto Cipolla

Indexes 1 Demo 2 Indexes Index = data structure

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data

Dow Jones Sustainability Indexes A cooperation of Dow Jones Indexes and SAM Content Key

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter term 07/08 Text search

Review: summary of the performance of symbol-table implementations Order of growth of the

CS 310 Advanced Data Structures and Algorithms Searching June 14, 2018 Mohammad Hadian

Natural Language Processing Class is now big enough for big class policies Late days: 7

Sorting and Searching by Distribution: From Generic Discrimination to Generic Tries Fritz Henglein

Indexing and Searching Indexing and Searching Berlin Chen 2005 References: 1. Modern

Solving The Words Search Problem Ivan Kazmenko St. Petersburg State University Tuesday, July 5,

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is