Image: ALCÁZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017
Agenda Introduction What is Dictionary Compression? Compressed String Dictionaries Some Experimental Numbers RDF Dictionaries Foundations RDF Dictionary-based Compression Dictionaries in Practice Conclusions PAGE 2 images: zurb.com
Dictionary Compression Introduction • What is Dictionary Compression? • Compressed String Dictionaries
What is Dictionary Compression? “ Dictionary compression is a simple but effective technique which replaces the occurrences of terms by identifiers which are more compact to encode and easier and more efficient to handle. PAGE 4 DICTIONARY COMPRESSION
Dictionary Compression Dictionary Compression Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length) terms by (short) identifiers which are more compact to encode and easier and more efficient to handle. Implementing this class of compression requires an efficient data structure configuration (dictionary) which provides, at least, two basic mapping operations: locate(t) returns i if the term t is the i-th element in the dictionary. extract(i) returns the i-th term ( t ) in the dictionary. The dictionary organizes all different terms (vocabulary) in the dataset. Dictionary compression has been traditionally applied for natural language processing purposes (e.g. information retrieval). PAGE 5 DICTIONARY COMPRESSION
Dictionary Compression ID String 1 he … 2 la 3 niña la tarara sí 4 la tarara no no la tarara niña 5 que que la he visto yo 6 sí … 7 tarara 8 visto 9 yo data structure PAGE 6 DICTIONARY COMPRESSION
Dictionary Compression ID String 1 he … 2 la 3 niña 2 7 6 4 2 7 4 no 2 7 3 5 que 5 2 1 8 9 6 sí … 7 tarara 8 visto 9 yo data structure PAGE 7 DICTIONARY COMPRESSION
Dictionary Compression The original text takes 59 bytes 59 chars * 1 byte/char … la tarara sí la tarara no la tarara niña que la he visto yo … PAGE 8 DICTIONARY COMPRESSION
Dictionary Compression The original text takes 59 bytes 59 chars * 1 byte/char … The dictionary compressed text takes 7 bytes 2 7 6 14 IDs * log 2 (9) bits/ID 2 7 4 + 2 7 3 the cost of serializing 5 2 1 8 9 the data structure. … PAGE 9 DICTIONARY COMPRESSION
Dictionary Compression Dictionary Compression is used for optimizing applications of… Natural Language Processing (e.g. Information Retrieval or Machine Translation) Web Graph Management . Triplestores (e.g. RDF3X) and other semantic tools (e.g. HDT) NoSQL databases . Bioinformatics search engines . Internet Routing . Geographic Information Systems . …. PAGE 10 DICTIONARY COMPRESSION
Data Structures Dictionaries have been traditionally implemented using well-known data structures: Hash tables or tries for resolving locate queries. Arrays for resolving extract queries. These solutions are efficient, but require high amounts of memory for using them in practical scenarios. PAGE 11 DICTIONARY COMPRESSION
The Problem … Data sets are increasingly bigger and more varied: Vocabularies are also larger and comprise more heterogeneous terms. The dictionary size is a bottleneck for applications running under restrictions of main memory. The resulting dictionary data structure is very large and do not scale for efficient in-memory management: Dictionary management is becoming a scalability issue by itself and it must be optimized for Big Data scenarios. Preconditions: Dictionaries are static (they are rebuilt from the scratch when the vocabulary is changed). Dictionaries are cached in main memory. PAGE 12 DICTIONARY COMPRESSION
Compressed String Dictionaries “ Compressed String Dictionaries are a particular class of compacta data structure which is optimize for dealing with string vocabularies from different domains. PAGE 13 DICTIONARY COMPRESSION
The Solutions … Innovative compressed string dictionaries are proposed for managing big vocabularies in main memory: Traditional dictionaries are revisited for optimizing their memory footprint. Existing compact data structures are tuned to perform as dictionaries. New compact data structures has been designed as compressed string dictionaries. All these techniques ensure efficient in-memory query resolution : locate and extract are resolved at microsecond level. New interesting queries are also supported by these techniques: Prefix-based queries retrieve IDs / terms matching a given prefix. Substring-based queries retrieve IDs / terms matching a given substring. PAGE 14 DICTIONARY COMPRESSION
Queries ID String locate (“ tarara ”) = 7 1 he extract (2) = “la” 2 la 3 niña locatePrefix (“n”) = {3,4} 4 no extractPrefix (“n”) = {“ niña ”,”no”} 5 que 6 sí locateSubstring (“a”) = {2,3,7} 7 tarara extractSubstring (“a”) = {“la”,” niña ”,” tarara ”} 8 visto 9 yo PAGE 15 DICTIONARY COMPRESSION
Techniques for Compressing Dictionaries Compressed Hash : The hash table is simulated using bitmaps. Strings are stored in compressed form ( Huffman/Re-Pair ). locate / extract operations are implemented using rank / select . Differential Front-Coding Compression : Front-Coding exploits that consecutive strings (in the vocabulary) are likely to share a common prefix. Plain Front-Coding dictionaries use byte-oriented compression. Compressed Front-Coding dictionaries combines HuTucker and Huffman/Re-Pair compression. Primitive and prefix-based operations are implemented using binary search and efficient sequential decoding . Self-Indexes : The FM-Index is adapted to perform as dictionary and the XBW introduce a self- indexed trie. All operations are implemented exploiting the BWT features . PAGE 16 DICTIONARY COMPRESSION
More Details … PAGE 17 DICTIONARY COMPRESSION
Some Experimental Numbers “ Compressed String Dictionaries answer queries at the level of microseconds, while compressing vocabularies up to 20 times. PAGE 18 DICTIONARY COMPRESSION
Experimental Setup We analyze compression effectiveness and retrieval speed: locate, extract . Prefix-based operations (URIs) Substring-based operations (Literals). In practice, extract is the most important query : It is used many times as results are retrieved from the compressed dataset. 26,948,638 URIs from Uniprot : Averaged length: 51.04 chars per URI. Highly-repetitive. 27,592,013 Literals from DBpedia : Averaged length: 60.45 chars per Literal. PAGE 19 DICTIONARY COMPRESSION
Locate / Extract Performance (URIs) PFC is the faster choice for locate/extract … locate ≈ 1.6 μ s/string. extract ≈ 0.3-0.6 μ s/ID. ..but requires more space: ≈ 9 − 19 % of the original space . HTFC (compressed Front-Coding) reports the most balanced space/time tradeoffs: locate ≈ 2.2-3 μs /string . extract ≈ 0.7-1.6 μs /ID. ≈ 5 − 13 % of the original space . PAGE 20 DICTIONARY COMPRESSION
Locate / Extract Performance (Literals) HTFC reports the best compression ratios, but its performance is less competitive: locate ≈ 2-2.5 μs /string . extract > 2.5 μs /ID. ≈ 12 % of the original space. HashDAC-rp (compressed Hashing) reports the best tradeoffs: locate ≈ 1.5 μs /string . extract ≈ 1μs/ID . ≈ 15 % of the original space . PAGE 21 DICTIONARY COMPRESSION
Domain Entity Retrieval (URIs) PFC is the best choice for prefix-based operations: Although it uses more space than the other approaches. PAGE 22 DICTIONARY COMPRESSION
Full-Text Search (Literals) Self-index based dictionaries are the only ones providing fullt-text search: FMI is the fastest solution (≈ 1μs/result) when uses more space than the original vocabulary. XBW is the better choice for this scenario: ≈ 5-6 μs /result. ≈ 40% of the original space . PAGE 23 DICTIONARY COMPRESSION
Dictionary Compression RDF Dictionaries • Foundations • RDF Dictionary-based Compression • Dictionaries in Practice
Foundations “ RDF Dictionaries are a core component of any compression or indexing approach desgined for semantic datasets. PAGE 25 DICTIONARY COMPRESSION
Basics An RDF dictionary comprises all different terms used in the dataset: Terms are drawn from 3 disjoint vocabularies: URIs , Literals , and blank nodes . URIs are medium-size strings which share long prefixes: http://example.org/property/age http://example.org/property/location http://example.org/person/abe-simpson http://example.org/person/bart-simpson Literals tends to be large-size strings (with no predictable features), or numbers, or dates…: “742 Evergreen Terrace” “Bart Simpson” “Homer Simpson” 10 Blank node serialization is not standardized: “Auto - incremental” strings are usually used → similar features than URIs. PAGE 26 DICTIONARY COMPRESSION
Recommend
More recommend