vgram improving performance of approximate queries on
play

VGRAM: Improving Performance of Approximate Queries on String - PowerPoint PPT Presentation

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China Approximate selection queries Keanu Reeves Samuel Jackson


  1. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China

  2. Approximate selection queries Keanu Reeves Samuel Jackson Schwarzenegger Schwarrzenger Samuel Jackson … Query errors: Limited knowledge about data � Applications Typos � Spellchecking Limited input device (cell phone) input � � Data errors Query relaxation � Typos � … � Web data � OCR � 2

  3. Record linkage R S infromix informix … microsoft mcrosoft … … … Similarity functions: Applications � Edit distance � Record linkage � Jaccard � … � Cosine � … 3

  4. “ q-grams ” of strings u n i v e r s a l 2-grams 4

  5. q-gram inverted lists at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 5 3

  6. Searching using inverted lists � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sh ht ti ic ck ti ic ck # of common grams >= 3 at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 6 3

  7. 2-grams � 3-grams? � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sht hti tic ick tic ick # of common grams >= 1 ati 4 ich 0 2 ick 1 id strings id strings id strings � Shorter inverted list ric 0 0 0 0 rich rich rich � More false positive sta 4 3-grams 1 1 1 stick stick stick sti 1 2 2 2 2 stich stich stich stu 3 3 3 3 stuck stuck stuck tat 4 4 4 4 static static static tic 2 1 4 tuc 3 7 uck 3

  8. Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 8

  9. Motivation � Small index size (memory) � Small running time � Merge matched inverted lists � Calculate ED(query, candidate) 9

  10. Observation 1: dilemma of choosing “q” � Increasing “q” causing: � Longer grams � Shorter lists � Smaller # of common grams of similar strings at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 10 3

  11. Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles � � Popular 5-grams: ation (>114K times), tions, ystem, catio 11

  12. VGRAM: Main idea � Grams with variable lengths (between q min and q max ) � zebra � ze(123) � corrasion � co(5213), cor(859), corr(171) � Advantages � Reducing index size ☺ � Reducing running time ☺ � Adoptable by many algorithms ☺ 12

  13. Challenges � Generating variable-length grams? � Constructing a high-quality gram dictionary? � Relationship between string similarity and their gram-set similarity? � Adopting VGRAM in existing algorithms? 13

  14. Challenge 1: String � Variable-length grams? � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 14

  15. Representing gram dictionary as a trie � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 15

  16. Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 16

  17. Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 17

  18. Final gram dictionary 18 Final grams

  19. Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 19

  20. Challenge 3: Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams 20

  21. Deletion affects variable-length grams Not affected Not affected Affected i i - q max +1 i + q max - 1 Deletion 21

  22. Grams affected by a deletion Affected? i i - q max +1 i + q max - 1 Deletion Deletion u n i v e r s a l Affected? [2,4]-grams 22

  23. Grams affected by a deletion (cont) Affected? i i - q max +1 i + q max - 1 Deletion 23 Trie of grams Trie of reversed grams

  24. # of grams affected by each operation Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 24

  25. Max # of grams affected by k operations Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ Vector of s = <2,4> With 2 edit operations, at most 4 grams can be affected � Called NAG vector (# of affected grams) � Precomputed 25

  26. Summary of VGRAM index 26

  27. Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: � String s � grams � String s1, s2 such that ed(s1,s2) <= k � min # of their common grams 27

  28. Lower bound on # of common grams Fixed length ( q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (| s 1 | - q + 1) – k * q Variable lengths: lower bound = # of grams of s1 – NAG(s1,k) 28

  29. Example: algorithm using inverted lists � Query: “shtick”, ED(shtick, ?) ≤ 1 sh ht tick tick 2-grams 2-4 grams … … Lower bound = 3 ck 1 3 ck 1 3 ic 4 1 ic 1 2 4 0 ich 2 0 … … ti 1 2 4 tic 2 4 … id strings id strings id strings tick 1 0 0 0 rich rich rich … 1 1 1 stick stick stick 2 2 2 stich stich stich Lower bound = 1 3 3 3 stuck stuck stuck 29 4 4 4 static static static

  30. Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 30

  31. Data sets � Data set 1 : Texas Real Estate Commission. � 151 K person names, average length = 33. � Data set 2 : English dictionary from the Aspell spellchecker for Cygwin. � 149 , 165 words, average length = 8. � Data set 3 : DBLP Bibliography. � 277 K titles, average length = 62. Environment: VC++, Dell GX620 PC with an Intel Pentium 3.40Hz Dual Core CPU, 2GB memory, Window XP O.S. 31

  32. VGRAM overhead (index size) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 32

  33. VGRAM overhead (construction time) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 33

  34. Benefits over fixed-length grams (index) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 34 T=1000, LargeFirst pruning policy

  35. Benefits over fixed-length grams (running time) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 35 T=1000, LargeFirst pruning policy

  36. Enhance approximate join algorithms � ProbeCount � ProbeCluster � PartEnum 36

  37. Improving algorithm ProbeCount K=3 50K person names Dataset 1: [4,6]-gram, T=200, LargeFirst pruning policy 37

  38. Improving algorithm ProbeCluster Dataset 1: [5,7]-gram, T=1000, LargeFirst pruning policy 38

  39. Improving algorithm PartEnum Dataset 1: [4,6]-gram, T=1000, LargeFirst pruning policy 39

  40. Conclusions � VGRAM: using grams of � variable-length � high-quality � Adoptable in existing algorithms � Reduce index size � Reduce running time 40

  41. Related work � Approximate String Matching � q-Grams, q-Samples � Inside DBMS � Substring matching � Set similarity join � Variable length gram applications � Speech recognition, information retrieval, artificial intelligence � Substring selectivity estimation � Improve space and time efficiency � n-Gram/2L 41

  42. Questions or Comments? Thank you 42

Recommend


More recommend