VGRAM: Improving Performance of Approximate Queries on String - PowerPoint PPT Presentation

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China

Approximate selection queries Keanu Reeves Samuel Jackson Schwarzenegger Schwarrzenger Samuel Jackson … Query errors: Limited knowledge about data � Applications Typos � Spellchecking Limited input device (cell phone) input � � Data errors Query relaxation � Typos � … � Web data � OCR � 2

Record linkage R S infromix informix … microsoft mcrosoft … … … Similarity functions: Applications � Edit distance � Record linkage � Jaccard � … � Cosine � … 3

“ q-grams ” of strings u n i v e r s a l 2-grams 4

q-gram inverted lists at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 5 3

Searching using inverted lists � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sh ht ti ic ck ti ic ck # of common grams >= 3 at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 6 3

2-grams � 3-grams? � Query: “ shtick ” , ED(shtick, ?) ≤ 1 sht hti tic ick tic ick # of common grams >= 1 ati 4 ich 0 2 ick 1 id strings id strings id strings � Shorter inverted list ric 0 0 0 0 rich rich rich � More false positive sta 4 3-grams 1 1 1 stick stick stick sti 1 2 2 2 2 stich stich stich stu 3 3 3 3 stuck stuck stuck tat 4 4 4 4 static static static tic 2 1 4 tuc 3 7 uck 3

Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 8

Motivation � Small index size (memory) � Small running time � Merge matched inverted lists � Calculate ED(query, candidate) 9

Observation 1: dilemma of choosing “q” � Increasing “q” causing: � Longer grams � Shorter lists � Smaller # of common grams of similar strings at 4 ch 0 2 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 10 3

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles � � Popular 5-grams: ation (>114K times), tions, ystem, catio 11

VGRAM: Main idea � Grams with variable lengths (between q min and q max ) � zebra � ze(123) � corrasion � co(5213), cor(859), corr(171) � Advantages � Reducing index size ☺ � Reducing running time ☺ � Adoptable by many algorithms ☺ 12

Challenges � Generating variable-length grams? � Constructing a high-quality gram dictionary? � Relationship between string similarity and their gram-set similarity? � Adopting VGRAM in existing algorithms? 13

Challenge 1: String � Variable-length grams? � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 14

Representing gram dictionary as a trie � Fixed-length 2-grams u n i v e r s a l � Variable-length grams [2,4]-gram dictionary ni u n i v e r s a l ivr sal uni vers 15

Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 16

Challenge 2: Constructing gram dictionary � selecting grams Pruning trie using a frequency threshold T (e.g., 2) � 17

Final gram dictionary 18 Final grams

Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 19

Challenge 3: Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams 20

Deletion affects variable-length grams Not affected Not affected Affected i i - q max +1 i + q max - 1 Deletion 21

Grams affected by a deletion Affected? i i - q max +1 i + q max - 1 Deletion Deletion u n i v e r s a l Affected? [2,4]-grams 22

Grams affected by a deletion (cont) Affected? i i - q max +1 i + q max - 1 Deletion 23 Trie of grams Trie of reversed grams

# of grams affected by each operation Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 24

Max # of grams affected by k operations Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _ u _ n _ i _ v _ e _ r _ s _ a _ l _ Vector of s = <2,4> With 2 edit operations, at most 4 grams can be affected � Called NAG vector (# of affected grams) � Precomputed 25

Summary of VGRAM index 26

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: � String s � grams � String s1, s2 such that ed(s1,s2) <= k � min # of their common grams 27

Lower bound on # of common grams Fixed length ( q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (| s 1 | - q + 1) – k * q Variable lengths: lower bound = # of grams of s1 – NAG(s1,k) 28

Example: algorithm using inverted lists � Query: “shtick”, ED(shtick, ?) ≤ 1 sh ht tick tick 2-grams 2-4 grams … … Lower bound = 3 ck 1 3 ck 1 3 ic 4 1 ic 1 2 4 0 ich 2 0 … … ti 1 2 4 tic 2 4 … id strings id strings id strings tick 1 0 0 0 rich rich rich … 1 1 1 stick stick stick 2 2 2 stich stich stich Lower bound = 1 3 3 3 stuck stuck stuck 29 4 4 4 static static static

Outline � Motivation � VGRAM � Main idea � Decomposing strings to grams � Choosing good grams � Effect of edit operations on grams � Adopting vgram in existing algorithms � Experiments 30

Data sets � Data set 1 : Texas Real Estate Commission. � 151 K person names, average length = 33. � Data set 2 : English dictionary from the Aspell spellchecker for Cygwin. � 149 , 165 words, average length = 8. � Data set 3 : DBLP Bibliography. � 277 K titles, average length = 62. Environment: VC++, Dell GX620 PC with an Intel Pentium 3.40Hz Dual Core CPU, 2GB memory, Window XP O.S. 31

VGRAM overhead (index size) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 32

VGRAM overhead (construction time) Dataset 3: DBLP titles, [5,7]-gram, T=500, LargeFirst pruning policy 33

Benefits over fixed-length grams (index) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 34 T=1000, LargeFirst pruning policy

Benefits over fixed-length grams (running time) Dataset 1: 150K Person names, k=1, MergeCount algorithm, 35 T=1000, LargeFirst pruning policy

Enhance approximate join algorithms � ProbeCount � ProbeCluster � PartEnum 36

Improving algorithm ProbeCount K=3 50K person names Dataset 1: [4,6]-gram, T=200, LargeFirst pruning policy 37

Improving algorithm ProbeCluster Dataset 1: [5,7]-gram, T=1000, LargeFirst pruning policy 38

Improving algorithm PartEnum Dataset 1: [4,6]-gram, T=1000, LargeFirst pruning policy 39

Conclusions � VGRAM: using grams of � variable-length � high-quality � Adoptable in existing algorithms � Reduce index size � Reduce running time 40

Related work � Approximate String Matching � q-Grams, q-Samples � Inside DBMS � Substring matching � Set similarity join � Variable length gram applications � Speech recognition, information retrieval, artificial intelligence � Substring selectivity estimation � Improve space and time efficiency � n-Gram/2L 41

Questions or Comments? Thank you 42

VGRAM: Improving Performance of Approximate Queries on String - PowerPoint PPT Presentation

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China Approximate selection queries Keanu Reeves Samuel Jackson

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Quotient Filters: Approximate Membership Queries on the GPU Afton Geil University of California,

New Requirements Top-N/Bottom-N queries Interactive queries Decision making

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Correlation Clustering using Same-Cluster Queries Ragesh Jaiswal CSE, IIT Delhi

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Computational Geometry Lecture 15: Windowing queries Computational Geometry Lecture 15:

` of Dark Matter with the Milky Way Satellite Galaxies Ting Li Lederman Fellow Fermi National

R E G R E S S I O N A N D I N F E R E N C E PMAP 8521: Program Evaluation for Public Service

Partially - Commutative Context - Free Languages Wojciech Czerwi ski S awomir Lasota G

Dimitri Nion Post-Doc fellow, KU Leuven, Kortrijk, Belgium E-mail:

A Hybrid Solution for Mixed Workloads on Dynamic Graphs Mahashweta Das, Alkis Simitsis, Kevin

HydroGaia Technical University of Crete TUC School of Environmental Engineering

Regular Chains under Linear Changes of Coordinates and Applications Parisa Alvandi, Changbo Chen,

CTAC Services in remote and rural context Kathleen McCulloch Community Lead Nurse NHS Western

VGRAM: Improving Performance of Approximate Queries on String - PowerPoint PPT Presentation

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China Approximate selection queries Keanu Reeves Samuel Jackson

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Quotient Filters: Approximate Membership Queries on the GPU Afton Geil University of California,

New Requirements Top-N/Bottom-N queries Interactive queries Decision making

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate Correlation Clustering using Same-Cluster Queries Ragesh Jaiswal CSE, IIT Delhi

Geometric Algorithms Range &amp; windowing queries (2 lectures) Database queries 2/180 G.

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Top-k Queries over Uncertain Scores Qing Liu, Debabrota Basu, Talel Abdessalem, St ephane

Computational Geometry Lecture 15: Windowing queries Computational Geometry Lecture 15:

` of Dark Matter with the Milky Way Satellite Galaxies Ting Li Lederman Fellow Fermi National

R E G R E S S I O N A N D I N F E R E N C E PMAP 8521: Program Evaluation for Public Service

Partially - Commutative Context - Free Languages Wojciech Czerwi ski S awomir Lasota G

Dimitri Nion Post-Doc fellow, KU Leuven, Kortrijk, Belgium E-mail:

A Hybrid Solution for Mixed Workloads on Dynamic Graphs Mahashweta Das, Alkis Simitsis, Kevin

HydroGaia Technical University of Crete TUC School of Environmental Engineering

Regular Chains under Linear Changes of Coordinates and Applications Parisa Alvandi, Changbo Chen,

CTAC Services in remote and rural context Kathleen McCulloch Community Lead Nurse NHS Western

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.