Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio - PowerPoint PPT Presentation

Image: ALCÁZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

Agenda Introduction  What is Dictionary Compression?  Compressed String Dictionaries  Some Experimental Numbers  RDF Dictionaries  Foundations  RDF Dictionary-based Compression  Dictionaries in Practice  Conclusions  PAGE 2 images: zurb.com

Dictionary Compression Introduction • What is Dictionary Compression? • Compressed String Dictionaries

What is Dictionary Compression? “ Dictionary compression is a simple but effective technique which replaces the occurrences of terms by identifiers which are more compact to encode and easier and more efficient to handle. PAGE 4 DICTIONARY COMPRESSION

Dictionary Compression Dictionary Compression Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length) terms by (short) identifiers which are more compact to encode and easier and more efficient to handle. Implementing this class of compression requires an efficient data  structure configuration (dictionary) which provides, at least, two basic mapping operations: locate(t) returns i if the term t is the i-th element in the dictionary.  extract(i) returns the i-th term ( t ) in the dictionary.  The dictionary organizes all different terms (vocabulary) in the dataset.  Dictionary compression has been traditionally applied for natural language  processing purposes (e.g. information retrieval). PAGE 5 DICTIONARY COMPRESSION

Dictionary Compression ID String 1 he … 2 la 3 niña la tarara sí 4 la tarara no no la tarara niña 5 que que la he visto yo 6 sí … 7 tarara 8 visto 9 yo data structure PAGE 6 DICTIONARY COMPRESSION

Dictionary Compression ID String 1 he … 2 la 3 niña 2 7 6 4 2 7 4 no 2 7 3 5 que 5 2 1 8 9 6 sí … 7 tarara 8 visto 9 yo data structure PAGE 7 DICTIONARY COMPRESSION

Dictionary Compression The original text takes 59 bytes 59 chars * 1 byte/char … la tarara sí la tarara no la tarara niña que la he visto yo … PAGE 8 DICTIONARY COMPRESSION

Dictionary Compression The original text takes 59 bytes 59 chars * 1 byte/char … The dictionary compressed text takes 7 bytes 2 7 6 14 IDs * log 2 (9) bits/ID 2 7 4 + 2 7 3 the cost of serializing 5 2 1 8 9 the data structure. … PAGE 9 DICTIONARY COMPRESSION

Dictionary Compression Dictionary Compression is used for optimizing applications of…  Natural Language Processing (e.g. Information Retrieval or  Machine Translation) Web Graph Management .  Triplestores (e.g. RDF3X) and other semantic tools (e.g.  HDT) NoSQL databases .  Bioinformatics search engines .  Internet Routing .  Geographic Information Systems .  ….  PAGE 10 DICTIONARY COMPRESSION

Data Structures Dictionaries have been traditionally implemented using  well-known data structures: Hash tables or tries for resolving locate queries.  Arrays for resolving extract queries.  These solutions are efficient, but require high amounts  of memory for using them in practical scenarios. PAGE 11 DICTIONARY COMPRESSION

The Problem … Data sets are increasingly bigger and more varied:  Vocabularies are also larger and comprise more heterogeneous terms.  The dictionary size is a bottleneck for applications running under  restrictions of main memory. The resulting dictionary data structure is very large and do not scale for  efficient in-memory management: Dictionary management is becoming a scalability issue by itself and it  must be optimized for Big Data scenarios. Preconditions:  Dictionaries are static (they are rebuilt from the scratch when the vocabulary is  changed). Dictionaries are cached in main memory.  PAGE 12 DICTIONARY COMPRESSION

Compressed String Dictionaries “ Compressed String Dictionaries are a particular class of compacta data structure which is optimize for dealing with string vocabularies from different domains. PAGE 13 DICTIONARY COMPRESSION

The Solutions … Innovative compressed string dictionaries are proposed for managing big  vocabularies in main memory: Traditional dictionaries are revisited for optimizing their memory footprint.  Existing compact data structures are tuned to perform as dictionaries.  New compact data structures has been designed as compressed string  dictionaries. All these techniques ensure efficient in-memory query resolution :  locate and extract are resolved at microsecond level.  New interesting queries are also supported by these techniques:  Prefix-based queries retrieve IDs / terms matching a given prefix.  Substring-based queries retrieve IDs / terms matching a given substring.  PAGE 14 DICTIONARY COMPRESSION

Queries ID String locate (“ tarara ”) = 7 1 he extract (2) = “la” 2 la 3 niña locatePrefix (“n”) = {3,4} 4 no extractPrefix (“n”) = {“ niña ”,”no”} 5 que 6 sí locateSubstring (“a”) = {2,3,7} 7 tarara extractSubstring (“a”) = {“la”,” niña ”,” tarara ”} 8 visto 9 yo PAGE 15 DICTIONARY COMPRESSION

Techniques for Compressing Dictionaries Compressed Hash :  The hash table is simulated using bitmaps.  Strings are stored in compressed form ( Huffman/Re-Pair ).  locate / extract operations are implemented using rank / select .  Differential Front-Coding Compression :  Front-Coding exploits that consecutive strings (in the vocabulary) are likely to share  a common prefix. Plain Front-Coding dictionaries use byte-oriented compression.  Compressed Front-Coding dictionaries combines HuTucker and Huffman/Re-Pair  compression. Primitive and prefix-based operations are implemented using binary search and  efficient sequential decoding . Self-Indexes :  The FM-Index is adapted to perform as dictionary and the XBW introduce a self-  indexed trie. All operations are implemented exploiting the BWT features .  PAGE 16 DICTIONARY COMPRESSION

More Details … PAGE 17 DICTIONARY COMPRESSION

Some Experimental Numbers “ Compressed String Dictionaries answer queries at the level of microseconds, while compressing vocabularies up to 20 times. PAGE 18 DICTIONARY COMPRESSION

Experimental Setup We analyze compression effectiveness and retrieval speed:  locate, extract .  Prefix-based operations (URIs)  Substring-based operations (Literals).  In practice, extract is the most important query :  It is used many times as results are retrieved from the compressed dataset.  26,948,638 URIs from Uniprot :  Averaged length: 51.04 chars per URI.  Highly-repetitive.  27,592,013 Literals from DBpedia :  Averaged length: 60.45 chars per Literal.  PAGE 19 DICTIONARY COMPRESSION

Locate / Extract Performance (URIs) PFC is the faster choice for locate/extract … locate ≈ 1.6 μ s/string.  extract ≈ 0.3-0.6 μ s/ID.  ..but requires more space:  ≈ 9 − 19 % of the original space .  HTFC (compressed Front-Coding) reports the most balanced space/time tradeoffs: locate ≈ 2.2-3 μs /string .  extract ≈ 0.7-1.6 μs /ID.  ≈ 5 − 13 % of the original space .  PAGE 20 DICTIONARY COMPRESSION

Locate / Extract Performance (Literals) HTFC reports the best compression ratios, but its performance is less competitive: locate ≈ 2-2.5 μs /string .  extract > 2.5 μs /ID.  ≈ 12 % of the original space.  HashDAC-rp (compressed Hashing) reports the best tradeoffs: locate ≈ 1.5 μs /string .  extract ≈ 1μs/ID .  ≈ 15 % of the original space .  PAGE 21 DICTIONARY COMPRESSION

Domain Entity Retrieval (URIs) PFC is the best choice for prefix-based operations: Although it uses more space than the other approaches.  PAGE 22 DICTIONARY COMPRESSION

Full-Text Search (Literals) Self-index based dictionaries are the only ones providing fullt-text search: FMI is the fastest solution (≈  1μs/result) when uses more space than the original vocabulary. XBW is the better choice for  this scenario: ≈ 5-6 μs /result.  ≈ 40% of the original space .  PAGE 23 DICTIONARY COMPRESSION

Dictionary Compression RDF Dictionaries • Foundations • RDF Dictionary-based Compression • Dictionaries in Practice

Foundations “ RDF Dictionaries are a core component of any compression or indexing approach desgined for semantic datasets. PAGE 25 DICTIONARY COMPRESSION

Basics An RDF dictionary comprises all different terms used in the dataset:  Terms are drawn from 3 disjoint vocabularies: URIs , Literals , and blank nodes .  URIs are medium-size strings which share long prefixes:  http://example.org/property/age http://example.org/property/location http://example.org/person/abe-simpson http://example.org/person/bart-simpson Literals tends to be large-size strings (with no predictable features), or  numbers, or dates…: “742 Evergreen Terrace” “Bart Simpson” “Homer Simpson” 10 Blank node serialization is not standardized:  “Auto - incremental” strings are usually used → similar features than URIs.  PAGE 26 DICTIONARY COMPRESSION

Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio - PowerPoint PPT Presentation

Image: ALCZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

6. Dictionary models for text compression Previous techniques: Predictive, statistical One

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

HTTP/2 Compression Dictionaries Vlad Krasnov In a nutshell Allow cross-stream compression in

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Ensuring high quality clinical placements through appropriate mentoring and support Dr Mahendra G

SmartEntry: Mitigating Routing Update Overhead with Reinforcement Learning for Traffic Engineering

Conceptual Architecture Sofware Architecture VO (706.706) Roman Kern Version 2.3 Institute for

A Smart Port Card Update John DeHart Washington University jdd@arl.wustl.edu

Nonlinear Signal Processing (2004-2005) Course Overview Instituto Superior T ecnico, Lisbon,

Post - de v re v ie w & co u rse la u nch C OU R SE C R E ATION AT DATAC AMP Kaelen

Indexing Shan-Hung Wu CS, NTHU Outline Overview API in VanillaCore Hash-Based

Indexing vanilladb.org Outline Overview The API of Index in VanillaCore Hash-Based

Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio - PowerPoint PPT Presentation

Image: ALCZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

6. Dictionary models for text compression Previous techniques: Predictive, statistical One

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

HTTP/2 Compression Dictionaries Vlad Krasnov In a nutshell Allow cross-stream compression in

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Ensuring high quality clinical placements through appropriate mentoring and support Dr Mahendra G

SmartEntry: Mitigating Routing Update Overhead with Reinforcement Learning for Traffic Engineering

Conceptual Architecture Sofware Architecture VO (706.706) Roman Kern Version 2.3 Institute for

A Smart Port Card Update John DeHart Washington University jdd@arl.wustl.edu

Nonlinear Signal Processing (2004-2005) Course Overview Instituto Superior T ecnico, Lisbon,

Post - de v re v ie w &amp; co u rse la u nch C OU R SE C R E ATION AT DATAC AMP Kaelen

Indexing Shan-Hung Wu CS, NTHU Outline Overview API in VanillaCore Hash-Based

Indexing vanilladb.org Outline Overview The API of Index in VanillaCore Hash-Based

Post - de v re v ie w & co u rse la u nch C OU R SE C R E ATION AT DATAC AMP Kaelen