Optimally L Leveraging D Densi sity a and L Locality for E Exploratory B Browsi sing a and S Sampling Albert Kim 1* , Liqi Xu 2* , Tarique Siddiqui 1 , Silu Huang 2 , Samuel Madden 1 , Aditya Parameswaran 2 1 MIT 2 University of Illinois (UIUC) 1
Mo Motivation Subset of voters who reside in Paris and voted for a specific candidate Some of genes that get positively induced after a clinical trial Example sessions on a given website on an IPhone X Summarization Browsing
Mo Motivation “Although big data demands aggregations, analysts wanted to see individual records to spotcheck their results, and to get a sense of what sat in a bucket.” [1] Any-k Problem: How to quickly return a small subset of records that satisfy arbitrary user- specified predicates? [1] Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data. Moritz et al.
Existing Exi ng Appr Approach: h: Bi Bitma map Index q Effective for traditional OLAP-style workloads q One bitmap per each attribute value q Index at the record level Bitmaps for ANYK probelm … Origin Mon Mon = 1 Mon = 2 Mon = 3 1 0 0 … ORD 1 q Inefficient for any-k problem 0 1 0 … ORD 2 0 1 0 … CMI 2 0 0 1 … CMI 3 1 0 0 … ORD 1 0 1 0 … ORD 2 q High storage cost 1 0 0 … CMI 1 1 0 0 … ORD 1 Bitmap Indices Airline Dataset
Our Ap Ou Approach: De Densit ity M Map ap I Index q Index at the block level q Read/Write in the unit of sector (e.g,. 4KB) q Consume less memory q Store the frequency of set bits per block … Origin Mon Mon = 1 Mon = 2 Mon = 3 … ORD 1 1 0 0 … ORD 2 0 1 0 Mon = 1 Mon = 2 Mon = 3 … CMI 2 0 1 0 0.5 0.5 0.0 … CMI 3 0 0 1 0.0 0.5 0.5 … ORD 1 1 0 0 0.5 0.5 0.0 # of tuples … ORD 2 0 1 0 1.0 0.0 0.0 per block: 2 … CMI 1 1 0 0 … ORD 1 1 0 0 Density Maps Bitmap Indices Airline Dataset
Ou Our Ap Approach: De Densit ity-Op Optimal Observation #1 [Density: Denser is better] SELECT ANY-K(*) FROM T WHERE Month = 1 AND Origin = “ORD” Orig = ”ORD”: Month = 1: (Sorted) (Sorted) Month = 1 AND Origin = “ORD”:
Ou Our Ap Approach: Lo Locality-Op Optimal Observation #2 [Locality: Closer is better] SELECT ANY-K(*) FROM T WHERE Month = 1 AND Origin = “ORD” Orig = ”ORD”: Month = 1: Month = 1 AND Origin = “ORD”: Density-Optimal vs Locality Optimal ?
Our Ap Ou Approach: I/ I/O Optim timal al # of samples q Leverages both density and locality Blocks q Uses dynamic programming q High Computation Cost Hybrid q Run both Density-Optimal and Locality- Optimal q Choose the set of blocks with the smaller estimated I/O Cost I/O Cost Model on HDDs
Expe Experimental Setting ng q Airline Dataset § 123 million rows and 11 attributes with a total size of 11 GB q Baselines: § Bitmap-Scan § Lossy-Bitmap § EWAH q Queries
Expe Experimental Resul sults q Hybrid: 4x faster q I/O: 90% of the runtime CPU I/O Query runtimes for airline workload on a HDD.
Expe Experimental Resul sults q Uncompressed bitmaps: 47x more memory q EWAH: 3x more memory q Lossy: slower query performance due to high false positives Memory consumption of index structures
Mo More in the paper! r! ü Density Maps ü ANY-K algorithms q Aggregation Estimation q Grouping + Join q More experimental results Needletail Architecture Technical Report: http://data-people.cs.illinois.edu/needletail.pdf
Recommend
More recommend