LSH-Based Probabilistic Pruning of Inverted Indices for Sets and Ranked Lists Koninika Pal and Sebastian Michel pal@cs.uni-kl.de smichel@cs.uni-kl.de TU Kaiserslautern, Germany 1 K. Pal - WebDB 2017
Introduction • Top-k Rankings, Preference lists K. Pal - WebDB 2017 2
• Top-k Rankings, Preference lists • Some applications: – Finding similar queries by results, – mining relations between entities, – recommender system, e.g. business promotion, etc. • Similarity search over ranked lists or sets of preferences K. Pal - WebDB 2017 3
Inverted Index • Inverted index handles set similarity efficiently. 1 ! h τ 1 i , h τ 2 i τ 1 = [2 , 5 , 4 , 3 , 1] 2 ! h τ 1 i , h τ 2 i τ 2 = [1 , 4 , 7 , 5 , 2] 3 ! h τ 1 i τ 3 = [0 , 8 , 7 , 5 , 6] . . . . . . • Filter: look up inverted index for each elements from query and collect candidates. • Validate: calculate distance between the candidates and the query 4 K. Pal - WebDB 2017
Motivation Higher similarity more overlapping elements Using multiple elements as key more precision Simple index Pairwise index (1 , 2) ! h τ 1 i , h τ 2 i 1 ! h τ 1 i , h τ 2 i (1 , 3) ! h τ 1 i 2 ! h τ 1 i , h τ 2 i 3 ! h τ 1 i (2 , 3) ! h τ 1 i . . . . . . . . . . . . Number of Key Increases exponentially ≫ Increase size of index structure ≫ Increase look up at query time query size 10 à 10 keys from query : 45 access keys from query 5 K. Pal - WebDB 2017
More similarity more overlapping elements How do we prune the index? Using multiple elements as key more precision How do we measure the effect of pruning in 7 ! h τ 2 i , h τ 3 i (5 , 6) ! h τ 1 i similarity search? 5 ! h τ 1 i , h τ 2 i , h τ 3 i (7 , 5) ! h τ 2 i , h τ 3 i 6 ! h τ 3 i . . . . . . . . . . . . Increase size of index structure Increase look up at query time 6 K. Pal - WebDB 2017
More similarity more overlapping elements How do we prune the index? Using multiple elements as key more precision How do we measure the effect of pruning in 7 ! h τ 2 i , h τ 3 i (5 , 6) ! h τ 1 i similarity search? 5 ! h τ 1 i , h τ 2 i , h τ 3 i (7 , 5) ! h τ 2 i , h τ 3 i 6 ! h τ 3 i . . . . . . . . . . . . Increase size of index structure Key idea: Connecting index structure with locality Increase look up at query time sensitive hashing (LSH) 7 K. Pal - WebDB 2017
Problem Description • Collection of sets of size k T τ i τ i = [2 , 5 , 4 , 3] • Input at query time: – A query of size k q – A distance threshold θ • Set similarity: Compute R = { τ i | τ i ∈ T and d ( τ i , q ) ≤ θ } = dissimilarity measure between d ( τ i , q ) τ i , q = Result set while using complete index structure R 8 K. Pal - WebDB 2017
• Index pruning factor φ • Query on pruned index return result set R p – R p ⊆ R • Additional Input to similarity search – Recall threshold % maximize � Objective: R p / R ≥ % subject to 9 K. Pal - WebDB 2017
Content • Motivation & Problem • Pruning of Inverted Index • Query Processing on Pruned index • Experimental Results • Conclusions 10 K. Pal - WebDB 2017
Pruning of Index Structure Randomly select Randomly delete Randomly delete factor of factor of keys factor of φ φ φ and delete the elements from elements in each each sets and complete entry. posting lists. then build the index. 11 K. Pal - WebDB 2017
Pruning of Index Structure • Similarity with document search: – Horizontal pruning: stop-word removal. – Vertical pruning: term-based index pruning (scoring model: tf-idf, KL-divergence etc.). – Diagonal pruning: document-centric pruning. • Contrast with common document retrieval: – Same size of query and documents. – Does not use score-based document search method. 12 K. Pal - WebDB 2017
Pruning of Index Structure • Similarity with document search: – Horizontal pruning: stop-word removal. – Vertical pruning: term-based index pruning (scoring model: tf-idf, KL-divergence etc.). – Diagonal pruning: document-centric pruning. • Contrast with common document retrieval: – Same size of query and documents. – Does not use score-based document search method. 13 K. Pal - WebDB 2017
Content • Motivation & Problem • Pruning of Inverted Index • Query Processing on Pruned index • Experimental Results • Conclusions 14 K. Pal - WebDB 2017
Connecting Index with LSH Family Hash tables (LSH index) Inverted index Hash_key1 à Objects map to key1 Key 1 à posting lists Hash_key2 à Objects map to key2 Key2 à posting lists …… …... One to one mapping Example: projects sets on single elements h x ( τ i ) = x if x ∈ τ i τ 2 = [1 , 4 , 7 , 5 , 2] h x : h 7 ( τ 2 ) = 7 τ 3 = [0 , 8 , 7 , 5 , 6] h 7 : 7 ! h τ 2 i , h τ 3 i 7 ! h τ 2 i , h τ 3 i – Multiple hash functions are possible to use conjunctively in LSH h 7 , h 5 : (7 , 5) ! h τ 2 i , h τ 3 i (7 , 5) ! h τ 2 i , h τ 3 i 15 K. Pal - WebDB 2017
Properties of LSH • Why LSH? – Similar objects have higher probability to collide into same bucket – Tuning of number of index entries( ) are needed to l look up to reach recall % % = 1 − (1 − P m 1 ) l • What we need? – Collision probability of hash function: P 1 – Number of hash functions are used at query time. 16 K. Pal - WebDB 2017
Query Processing on Pruned Index • Pruning of index -> dropping objects from LSH index • Missing collision at query processing: – Objects and query are not similar – Objects are dropped due to pruning • Access more entries ( ) than the LSH method l required. • How many extra index look up are required? 17 K. Pal - WebDB 2017
Ad-hoc Query Processing • Continue index look up until successful l accesses. % = 1 − (1 − P m 1 ) l • Max. lookup à look up all keys from query • Expected look ups: E [ l ] = (1 /f ) · l • Modifying factor in collision probability: f 18 K. Pal - WebDB 2017
Probabilistic Query Processing • Find modified collision probability • Find required modified number of accesses l Y % = 1 − (1 − f Y · P m 1 ) l Y • Modifying factor = function ( φ ) f Y – Modifying factor of horizontal pruning f h faction of index pruning • φ • à removing faction of keys φ = 1 – • φ f h 19 K. Pal - WebDB 2017
Optimizing the Pruning Factor ✓ k ◆ • Max. lookup à look up all keys from query t ✓ k ◆ • Look up is bound by l Y t • Number of access ( ): % = 1 − (1 − f Y · P m 1 ) l Y l Y • Modifying factor = function ( φ ) f Y n� k o φ ∗ = argmax φ � − l Y = 0 t 20 K. Pal - WebDB 2017
Case Studies • Case 1: Jaccard Distance over sets – Use pairwise index – Relate LSH 1 index to pairwise index % = 1 − (1 − P m 1 ) l 2 θ 1 + θ , m = 2 P 1 = • Case 2: Kendall’s Tau Distance over rankings [1] Koninika Pal and Sebastian Michel. Efficient Similarity Search across Top-k Lists under the Kendall’s Tau Distance. In SSDBM 2016. 21 K. Pal - WebDB 2017
Content • Motivation & Problem • Pruning of Inverted Index • Query Processing on Pruned index • Experimental Results • Conclusions 22 K. Pal - WebDB 2017
Experimental Setup • Datasets: – LifeJ : 100,000 profiles from Live Journal; truncated to set size = 20. – Yago : 25,000 top-20 rankings; Wikipedia based. • 5 consecutive experimental runs over 1000 queries. • Recall threshold = 99% % • Baseline approach: The plain LSH methods on the non- pruned index structures. • Full scan and prefix filtering 2 method in simple index retrieve candidates > 5 times than the baseline approach. [2] jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD. 23 K. Pal - WebDB 2017
Theoretically Established Parameters for LiveJ not Horizontal Vertical Diagonal pruned pruning pruning pruning φ ∗ φ ∗ l v l h φ ∗ l d θ l E [ l v ] E [ l d ] E [ l h ] 0.1 2 0.8 125 10 0.8 125 10 0.5 95 8.6 0.3 4 0.8 167 20 0.8 167 20 0.5 126 17.4 0.5 8 0.7 112 26.6 0.7 112 26.6 0.4 87 23.5 : Optimal pruning factor. : { h / v / d } φ ∗ Y : Number of scan for probabilistic query processing. l Y : Expected number of scan for successful scan. l E [ l Y ] 24 K. Pal - WebDB 2017
Experimental Results for Probabilistic Query Processing on LifeJ Pruning Time #successful Baseline θ #candidates recall θ l Y method (ms) scan candidates 0.1 11.17 10031.3 100 24.6 125 5105.3 0.3 Horizontal 11.54 13257.0 100 33.9 167 7360.4 0.5 13.39 14452.2 100 33.6 112 9059.5 0.1 14.0 11252.9 100 125 125 5105.3 0.3 Vertical 9.8 12208.7 100 167 167 7360.4 0.5 11.0 14001.9 100 112 112 9059.5 0.1 10.38 10378.3 99.5 79.69 95 5105.3 0.3 Diagonal 11.06 11512.7 100 104.58 126 7360.4 0.5 11.32 13003.1 99.7 76.84 87 9059.5 25 K. Pal - WebDB 2017
Recommend
More recommend