efficient parallel partition based algorithms for
play

Efficient Parallel Partition based Algorithms for Similarity Search - PowerPoint PPT Presentation

Motivation Our Approach Experiment Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity


  1. Motivation Our Approach Experiment Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition on EDBT/ICDT 2013 Dong Deng Parallel PassJoin

  2. Motivation Our Approach Experiment Outline Motivation 1 Problem Definition Application Our Approach 2 Pass Join Algorithm Additional Filters Parallel Experiment 3 Evaluating Pruning Techniques Evaluating Parallelism Evaluating Scalability Dong Deng Parallel PassJoin

  3. Motivation Problem Definition Our Approach Application Experiment Problem Definition S TRING S IMILARITY J OINS Given a set of strings S , the task is to find all pairs of τ -similar strings from S . A program must output all matches with both string identifiers and distance τ .(Track II) Dong Deng Parallel PassJoin

  4. Motivation Problem Definition Our Approach Application Experiment An Example Table: A string dataset ID Strings Length s 1 vankatesh 9 s 2 avataresha 10 s 3 kaushic chaduri 15 s 4 kaushik chakrab 15 s 5 kaushuk chadhui 15 s 6 caushik chakrabar 17 Consider the string dataset in Table 1. Suppose τ = 3. � s 4 , s 6 � is a similar pair as ED ( s 4 , s 6 ) ≤ τ Dong Deng Parallel PassJoin

  5. Motivation Problem Definition Our Approach Application Experiment Application Data cleaning Information Extraction Comparison of biological sequences ... Dong Deng Parallel PassJoin

  6. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Basic Idea Lemma Given a string r with τ + 1 segments and a string s, if s is similar to r within threshold τ , s must contain a segment of r. Example τ = 1, r = “EDBT” has two segments “ED” and “BT”. s = “ICDT” cannot similar to r as s contains none of the two segemtns. Dong Deng Parallel PassJoin

  7. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Even Partition Scheme Definition In even partition scheme, each segment has almost the same length. ( ⌊ | s | τ + 1 ⌋ or ⌈ | s | τ + 1 ⌉ ) Example τ = 3, we partition s 1 = “ vankatesh ” into four segments “ va ”, “ nk ”, “ at ”, “ esh ”. Dong Deng Parallel PassJoin

  8. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Basic Methods Enumeration: Enumerate all substrings for each of the segment. Length-based: For each segment, only select substrings with same length. Shift-based: For segment with start position p i , select substrings with start position in [ p i − τ, p i + τ ] Dong Deng Parallel PassJoin

  9. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i , select substrings with start position in [ p i − ⌊ τ −△ 2 ⌋ , p i + ⌊ τ + △ 2 ⌋ ] where △ = | s | − | r | . Dong Deng Parallel PassJoin

  10. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i , select substrings with start position in [ p i − ⌊ τ −△ 2 ⌋ , p i + ⌊ τ + △ 2 ⌋ ] where △ = | s | − | r | . Dong Deng Parallel PassJoin

  11. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Position-aware Substring Selection Example τ = 3, △ = 1, [ p i − ⌊ τ −△ 2 ⌋ , p i + ⌊ τ + △ 2 ⌋ ] = [ p i − 1 , p i + 2 ] Dong Deng Parallel PassJoin

  12. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r . Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i , select substrings within [ p i − i , p i + i ] ∩ [ p i + △− ( τ + 1 − i ) , p i + △ +( τ + 1 − i )] . Dong Deng Parallel PassJoin

  13. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r . Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i , select substrings within [ p i − i , p i + i ] ∩ [ p i + △− ( τ + 1 − i ) , p i + △ +( τ + 1 − i )] . Dong Deng Parallel PassJoin

  14. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Multi-match-aware Substring Selection Example Dong Deng Parallel PassJoin

  15. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Theoretical Results The number of selected substrings by the 1 multi-match-aware method is minimum. For strings longer than 2 ∗ ( τ + 1 ) , our selection method is 2 the only way to select minimum number of substrings. Dong Deng Parallel PassJoin

  16. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Experimental Results Length Length Length # of selected substrings # of selected substrings # of selected substrings 1e+009 Shift Shift 1e+011 Shift 1e+010 Positon Positon Positon Multi-Match Multi-Match Multi-Match 1e+010 1e+009 1e+008 1e+009 1e+008 1e+007 1e+008 1e+007 1e+007 1e+006 1e+006 1 2 3 4 4 5 6 7 8 5 6 7 8 9 10 Threshold τ Threshold τ Threshold τ (a) Author Name (b) Query Log (c) Author+Title (Avg Len = 15) (Avg Len = 45) (Avg Len = 105) Figure: Numbers of selected substrings Dong Deng Parallel PassJoin

  17. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Substring Selection Experimental Results 100 Length Length Length 1000 Selection Time (s) Selection Time (s) Selection Time (s) 10000 Shift Shift Shift Positon Positon Positon Multi-Match Multi-Match Multi-Match 10 1000 100 100 1 10 10 0.1 1 1 1 2 3 4 4 5 6 7 8 5 6 7 8 9 10 Threshold τ Threshold τ Threshold τ (a) Author Name (b) Query Log (c) Author+Title (Avg Len = 15) (Avg Len = 45) (Avg Len = 105) Figure: Elapsed time for generating substrings Dong Deng Parallel PassJoin

  18. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination. Dong Deng Parallel PassJoin

  19. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination. Dong Deng Parallel PassJoin

  20. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination. Dong Deng Parallel PassJoin

  21. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED ( r r , s r ) ≤ τ + 1 − i and ED ( r l , s l ) ≤ i − 1. Dong Deng Parallel PassJoin

  22. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED ( r r , s r ) ≤ τ + 1 − i and ED ( r l , s l ) ≤ i − 1. Dong Deng Parallel PassJoin

  23. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED ( r r , s r ) ≤ τ + 1 − i and ED ( r l , s l ) ≤ i − 1. Dong Deng Parallel PassJoin

  24. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Verification Experimental Results 10000 100000 2 τ +1 2 τ +1 2 τ +1 Elapsed Time (s) τ +1 Elapsed Time (s) τ +1 Elapsed Time (s) τ +1 10000 10000 Extension Extension Extension SharePrefix SharePrefix SharePrefix 1000 1000 1000 100 100 100 10 1 10 10 1 2 3 4 4 5 6 7 8 5 6 7 8 9 10 Threshold τ Threshold τ Threshold τ (a) Author Name (b) Query Log (c) Author+Title (Avg Len 15) (Avg Len 45) (Avg Len 105) Figure: Elapsed time for verification Dong Deng Parallel PassJoin

  25. Motivation Pass Join Algorithm Our Approach Additional Filters Experiment Parallel Additional Filters Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates. Dong Deng Parallel PassJoin

Recommend


More recommend