VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China
Approximate selection queries Keanu Reeves Samuel Jackson Schwarzenegger Schwarrzenger Samuel Jackson … Query errors: Limited knowledge about data � Applications Typos � Spellchecking Limited input device (cell phone) input � � Data errors Query relaxation � Typos � … � Web data � OCR � 2
Record linkage R S infromix informix … microsoft mcrosoft … … … Similarity functions: Applications � Edit distance � Record linkage � Jaccard � … � Cosine � … 3
“ q-grams ” of strings u n i v e r s a l 2-grams 4
q-gram inverted lists at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 5 3
Searching using inverted lists � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sh ht ti ic ck ti ic ck # of common grams >= 3 at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 6 3
2-grams � 3-grams? � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sht hti tic ick tic ick # of common grams >= 1 ati 4 ich 0 2 ick 1 id strings id strings id strings � Shorter inverted list ric 0 0 0 0 rich rich rich � More false positive sta 4 3-grams 1 1 1 stick stick stick sti 1 2 2 2 2 stich stich stich stu 3 3 3 3 stuck stuck stuck tat 4 4 4 4 static static static tic 2 1 4 tuc 3 7 uck 3
Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 8
Motivation � Small index size (memory) � Small running time � Merge matched inverted lists � Calculate ED(query, candidate) 9
Observation 1: dilemma of choosing “q” � Increasing “q” causing: � Longer grams � Shorter lists � Smaller # of common grams of similar strings at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 10 3
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles � � Popular 5-grams: ation (>114K times), tions, ystem, catio 11
VGRAM: Main idea � Grams with variable lengths (between q min and q max ) � zebra � ze(123) � corrasion � co(5213), cor(859), corr(171) � Advantages � Reducing index size ☺ � Reducing running time ☺ � Adoptable by many algorithms ☺ 12
Challenges � Generating variable-length grams? � Constructing a high-quality gram dictionary? � Relationship between string similarity and their gram-set similarity? � Adopting VGRAM in existing algorithms? 13
Challenge 1: String � Variable-length grams? � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 14
Representing gram dictionary as a trie � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 15
Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 16
Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 17
Final gram dictionary 18 Final grams
Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 19
Challenge 3: Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams 20
Deletion affects variable-length grams Not affected Not affected Affected i i - q max +1 i + q max - 1 Deletion 21
Grams affected by a deletion Affected? i i - q max +1 i + q max - 1 Deletion Deletion u n i v e r s a l Affected? [2,4]-grams 22
Grams affected by a deletion (cont) Affected? i i - q max +1 i + q max - 1 Deletion 23 Trie of grams Trie of reversed grams
# of grams affected by each operation Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 24
Max # of grams affected by k operations Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ Vector of s = <2,4> With 2 edit operations, at most 4 grams can be affected � Called NAG vector (# of affected grams) � Precomputed 25
Summary of VGRAM index 26
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: � String s � grams � String s1, s2 such that ed(s1,s2) <= k � min # of their common grams 27
Lower bound on # of common grams Fixed length ( q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (| s 1 | - q + 1) – k * q Variable lengths: lower bound = # of grams of s1 – NAG(s1,k) 28
Example: algorithm using inverted lists � Query: “shtick”, ED(shtick, ?) ≤ 1 sh ht tick tick 2-grams 2-4 grams … … Lower bound = 3 ck 1 3 ck 1 3 ic 4 1 ic 1 2 4 0 ich 2 0 … … ti 1 2 4 tic 2 4 … id strings id strings id strings tick 1 0 0 0 rich rich rich … 1 1 1 stick stick stick 2 2 2 stich stich stich Lower bound = 1 3 3 3 stuck stuck stuck 29 4 4 4 static static static
Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 30
Data sets � Data set 1 : Texas Real Estate Commission. � 151 K person names, average length = 33. � Data set 2 : English dictionary from the Aspell spellchecker for Cygwin. � 149 , 165 words, average length = 8. � Data set 3 : DBLP Bibliography. � 277 K titles, average length = 62. Environment: VC++, Dell GX620 PC with an Intel Pentium 3.40Hz Dual Core CPU, 2GB memory, Window XP O.S. 31
VGRAM overhead (index size) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 32
VGRAM overhead (construction time) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 33
Benefits over fixed-length grams (index) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 34 T=1000, LargeFirst pruning policy
Benefits over fixed-length grams (running time) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 35 T=1000, LargeFirst pruning policy
Enhance approximate join algorithms � ProbeCount � ProbeCluster � PartEnum 36
Improving algorithm ProbeCount K=3 50K person names Dataset 1: [4,6]-gram, T=200, LargeFirst pruning policy 37
Improving algorithm ProbeCluster Dataset 1: [5,7]-gram, T=1000, LargeFirst pruning policy 38
Improving algorithm PartEnum Dataset 1: [4,6]-gram, T=1000, LargeFirst pruning policy 39
Conclusions � VGRAM: using grams of � variable-length � high-quality � Adoptable in existing algorithms � Reduce index size � Reduce running time 40
Related work � Approximate String Matching � q-Grams, q-Samples � Inside DBMS � Substring matching � Set similarity join � Variable length gram applications � Speech recognition, information retrieval, artificial intelligence � Substring selectivity estimation � Improve space and time efficiency � n-Gram/2L 41
Questions or Comments? Thank you 42
Recommend
More recommend