dictionary compression
play

Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio - PowerPoint PPT Presentation

Image: ALCZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017


  1. Image: ALCÁZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

  2. Agenda Introduction  What is Dictionary Compression?  Compressed String Dictionaries  Some Experimental Numbers  RDF Dictionaries  Foundations  RDF Dictionary-based Compression  Dictionaries in Practice  Conclusions  PAGE 2 images: zurb.com

  3. Dictionary Compression Introduction • What is Dictionary Compression? • Compressed String Dictionaries

  4. What is Dictionary Compression? “ Dictionary compression is a simple but effective technique which replaces the occurrences of terms by identifiers which are more compact to encode and easier and more efficient to handle. PAGE 4 DICTIONARY COMPRESSION

  5. Dictionary Compression Dictionary Compression Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length) terms by (short) identifiers which are more compact to encode and easier and more efficient to handle. Implementing this class of compression requires an efficient data  structure configuration (dictionary) which provides, at least, two basic mapping operations: locate(t) returns i if the term t is the i-th element in the dictionary.  extract(i) returns the i-th term ( t ) in the dictionary.  The dictionary organizes all different terms (vocabulary) in the dataset.  Dictionary compression has been traditionally applied for natural language  processing purposes (e.g. information retrieval). PAGE 5 DICTIONARY COMPRESSION

  6. Dictionary Compression ID String 1 he … 2 la 3 niña la tarara sí 4 la tarara no no la tarara niña 5 que que la he visto yo 6 sí … 7 tarara 8 visto 9 yo data structure PAGE 6 DICTIONARY COMPRESSION

  7. Dictionary Compression ID String 1 he … 2 la 3 niña 2 7 6 4 2 7 4 no 2 7 3 5 que 5 2 1 8 9 6 sí … 7 tarara 8 visto 9 yo data structure PAGE 7 DICTIONARY COMPRESSION

  8. Dictionary Compression The original text takes 59 bytes 59 chars * 1 byte/char … la tarara sí la tarara no la tarara niña que la he visto yo … PAGE 8 DICTIONARY COMPRESSION

  9. Dictionary Compression The original text takes 59 bytes 59 chars * 1 byte/char … The dictionary compressed text takes 7 bytes 2 7 6 14 IDs * log 2 (9) bits/ID 2 7 4 + 2 7 3 the cost of serializing 5 2 1 8 9 the data structure. … PAGE 9 DICTIONARY COMPRESSION

  10. Dictionary Compression Dictionary Compression is used for optimizing applications of…  Natural Language Processing (e.g. Information Retrieval or  Machine Translation) Web Graph Management .  Triplestores (e.g. RDF3X) and other semantic tools (e.g.  HDT) NoSQL databases .  Bioinformatics search engines .  Internet Routing .  Geographic Information Systems .  ….  PAGE 10 DICTIONARY COMPRESSION

  11. Data Structures Dictionaries have been traditionally implemented using  well-known data structures: Hash tables or tries for resolving locate queries.  Arrays for resolving extract queries.  These solutions are efficient, but require high amounts  of memory for using them in practical scenarios. PAGE 11 DICTIONARY COMPRESSION

  12. The Problem … Data sets are increasingly bigger and more varied:  Vocabularies are also larger and comprise more heterogeneous terms.  The dictionary size is a bottleneck for applications running under  restrictions of main memory. The resulting dictionary data structure is very large and do not scale for  efficient in-memory management: Dictionary management is becoming a scalability issue by itself and it  must be optimized for Big Data scenarios. Preconditions:  Dictionaries are static (they are rebuilt from the scratch when the vocabulary is  changed). Dictionaries are cached in main memory.  PAGE 12 DICTIONARY COMPRESSION

  13. Compressed String Dictionaries “ Compressed String Dictionaries are a particular class of compacta data structure which is optimize for dealing with string vocabularies from different domains. PAGE 13 DICTIONARY COMPRESSION

  14. The Solutions … Innovative compressed string dictionaries are proposed for managing big  vocabularies in main memory: Traditional dictionaries are revisited for optimizing their memory footprint.  Existing compact data structures are tuned to perform as dictionaries.  New compact data structures has been designed as compressed string  dictionaries. All these techniques ensure efficient in-memory query resolution :  locate and extract are resolved at microsecond level.  New interesting queries are also supported by these techniques:  Prefix-based queries retrieve IDs / terms matching a given prefix.  Substring-based queries retrieve IDs / terms matching a given substring.  PAGE 14 DICTIONARY COMPRESSION

  15. Queries ID String locate (“ tarara ”) = 7 1 he extract (2) = “la” 2 la 3 niña locatePrefix (“n”) = {3,4} 4 no extractPrefix (“n”) = {“ niña ”,”no”} 5 que 6 sí locateSubstring (“a”) = {2,3,7} 7 tarara extractSubstring (“a”) = {“la”,” niña ”,” tarara ”} 8 visto 9 yo PAGE 15 DICTIONARY COMPRESSION

  16. Techniques for Compressing Dictionaries Compressed Hash :  The hash table is simulated using bitmaps.  Strings are stored in compressed form ( Huffman/Re-Pair ).  locate / extract operations are implemented using rank / select .  Differential Front-Coding Compression :  Front-Coding exploits that consecutive strings (in the vocabulary) are likely to share  a common prefix. Plain Front-Coding dictionaries use byte-oriented compression.  Compressed Front-Coding dictionaries combines HuTucker and Huffman/Re-Pair  compression. Primitive and prefix-based operations are implemented using binary search and  efficient sequential decoding . Self-Indexes :  The FM-Index is adapted to perform as dictionary and the XBW introduce a self-  indexed trie. All operations are implemented exploiting the BWT features .  PAGE 16 DICTIONARY COMPRESSION

  17. More Details … PAGE 17 DICTIONARY COMPRESSION

  18. Some Experimental Numbers “ Compressed String Dictionaries answer queries at the level of microseconds, while compressing vocabularies up to 20 times. PAGE 18 DICTIONARY COMPRESSION

  19. Experimental Setup We analyze compression effectiveness and retrieval speed:  locate, extract .  Prefix-based operations (URIs)  Substring-based operations (Literals).  In practice, extract is the most important query :  It is used many times as results are retrieved from the compressed dataset.  26,948,638 URIs from Uniprot :  Averaged length: 51.04 chars per URI.  Highly-repetitive.  27,592,013 Literals from DBpedia :  Averaged length: 60.45 chars per Literal.  PAGE 19 DICTIONARY COMPRESSION

  20. Locate / Extract Performance (URIs) PFC is the faster choice for locate/extract … locate ≈ 1.6 μ s/string.  extract ≈ 0.3-0.6 μ s/ID.  ..but requires more space:  ≈ 9 − 19 % of the original space .  HTFC (compressed Front-Coding) reports the most balanced space/time tradeoffs: locate ≈ 2.2-3 μs /string .  extract ≈ 0.7-1.6 μs /ID.  ≈ 5 − 13 % of the original space .  PAGE 20 DICTIONARY COMPRESSION

  21. Locate / Extract Performance (Literals) HTFC reports the best compression ratios, but its performance is less competitive: locate ≈ 2-2.5 μs /string .  extract > 2.5 μs /ID.  ≈ 12 % of the original space.  HashDAC-rp (compressed Hashing) reports the best tradeoffs: locate ≈ 1.5 μs /string .  extract ≈ 1μs/ID .  ≈ 15 % of the original space .  PAGE 21 DICTIONARY COMPRESSION

  22. Domain Entity Retrieval (URIs) PFC is the best choice for prefix-based operations: Although it uses more space than the other approaches.  PAGE 22 DICTIONARY COMPRESSION

  23. Full-Text Search (Literals) Self-index based dictionaries are the only ones providing fullt-text search: FMI is the fastest solution (≈  1μs/result) when uses more space than the original vocabulary. XBW is the better choice for  this scenario: ≈ 5-6 μs /result.  ≈ 40% of the original space .  PAGE 23 DICTIONARY COMPRESSION

  24. Dictionary Compression RDF Dictionaries • Foundations • RDF Dictionary-based Compression • Dictionaries in Practice

  25. Foundations “ RDF Dictionaries are a core component of any compression or indexing approach desgined for semantic datasets. PAGE 25 DICTIONARY COMPRESSION

  26. Basics An RDF dictionary comprises all different terms used in the dataset:  Terms are drawn from 3 disjoint vocabularies: URIs , Literals , and blank nodes .  URIs are medium-size strings which share long prefixes:  http://example.org/property/age http://example.org/property/location http://example.org/person/abe-simpson http://example.org/person/bart-simpson Literals tends to be large-size strings (with no predictable features), or  numbers, or dates…: “742 Evergreen Terrace” “Bart Simpson” “Homer Simpson” 10 Blank node serialization is not standardized:  “Auto - incremental” strings are usually used → similar features than URIs.  PAGE 26 DICTIONARY COMPRESSION

Recommend


More recommend