Introduction Simple Matching Fuzzy Matching Use Case Conclusion Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher Swiss PostgreSQL Users Group Swiss PGDay 2016, Rapperswil, 24.06.2016 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 1/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Outline Introduction 1 Simple Matching 2 Fuzzy Matching 3 Use Case 4 5 Conclusion Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 2/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Charles Clavadetscher Senior DB Engineer at KOF ETH Zurich KOF is the Center of Economic Research of the ETHZ the Swiss Institute of Technology in Zurich, Switzerland Independent economic research on business cycle tendencies for almost all sectors Maintenance of all databases at KOF: PostgreSQL, Oracle, MySQL and MSSQL Server. Focus on migrating to PostgreSQL Support in business process re-engineering Co-founder and treasurer of the SwissPUG, the Swiss PostgreSQL Users Group Member of the OK of the Swiss PGDay Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 3/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Principles of searches 1/2 Searching in a collection of texts written in a natural language is a huge challenge. Semantics and lexicology Meaning (FTS 1 ) Synonyms (FTS) Lexemes: Naked word base meanings Morphology Phonetic (soundex, metaphone, double metaphone) Morphems: Naked word base forms (FTS) Language independent Words (=, LIKE, ILIKE, regexp) Letters (levenshtein) N-gram subsets (trigrams) 1. FTS: Full Text Search Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 4/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Principles of searches 2/2 Methods can be classified based on many criteria. Here is short summary that includes the most important differentiators for our use case: Language dependency : Switzerland is a country with 4 official languages. Exact/Fuzzy : Use exact when possible and fuzzy otherwise. Indexability : Tradeoff between matching probability and execution speed. -----------+------------------------------------+------------------------------------- Language | Exact | Fuzzy dependent | not indexable | indexable | not indexable | indexable ===========|================|===================|==================|================== NO | LIKE | = (equality) | Levenshtein | Trigrams | ILIKE | LIKE (Trigrams) | | | Regexp | ILIKE (Trigrams) | | | | Regexp (Trigrams) | | -----------+----------------+-------------------+------------------+------------------ YES | | | Soundex | Full Text Search | | | Metaphone | | | | Double Metaphone | -----------+----------------+-------------------+------------------+------------------ Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 5/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Outline Introduction 1 Simple Matching 2 Fuzzy Matching 3 Use Case 4 Conclusion 5 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 6/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Classics = : Exact match. This can easily be indexed using mostly btree. db=# SELECT * FROM people WHERE people_id = 1765; LIKE : Partially exact match. The part that is given must match exactly, all the rest is irrelevant. In combination with trigrams (and some other operator classes if the initial part is fixed) you can use indexes. SIMILAR TO : Equivalent to LIKE. ILIKE : The same as LIKE but case insensitive. db=# SELECT * FROM people WHERE firstname ILIKE ’Johann%’; db=# SELECT * FROM people WHERE firstname ILIKE ’%johann%’; Regexp : With regular expressions you can search for a specific pattern. PostgreSQL has operators for regexp matching case sensitive and insensitive. As with LIKE/ILIKE you can build an index on the base field using trigrams. db=# SELECT * FROM people WHERE lastname ~ ’^[CB][a-z]*(ts|tsch)er$’; db=# SELECT * FROM companies WHERE name ~ ’.*[0-9]+.*’; Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 7/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Outline Introduction 1 Simple Matching 2 Fuzzy Matching 3 Use Case 4 Conclusion 5 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 8/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Definitions and Extensions Definitions Fuzzy Search : A process that locates entries that are likely to be relevant to a one or more search parameters even if the argument does not exactly correspond to the specific information available. Distance/Difference : A metric indicating how far are two strings apart. Similarity : A metric indicating how similar are two strings. Normalization : A technique to compute the relative strength of a metric (usually a number between 0 and 1) Extensions fuzzystrmatch : Soundex, Metaphone, Double Metaphone, Levenshtein. pg_trgm : Trigrams. Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 9/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Phonetic Methods Soundex db=# select soundex(’halloween’), soundex(’halouin’), db-# soundex(’les mots’), soundex(’lemó’); soundex | soundex | soundex | soundex ---------+---------+---------+--------- H450 | H450 | L253 | L50 Metaphone db=# select metaphone(’halloween’,5), metaphone(’halouin’,5), db-# metaphone(’les mots’,5), metaphone(’lemó’,5); metaphone | metaphone | metaphone | metaphone -----------+-----------+-----------+----------- HLWN | HLN | LSMTS | LM Double metaphone db=# select dmetaphone(’halloween’), dmetaphone(’halouin’), db-# dmetaphone(’les mots’), dmetaphone(’lemó’); dmetaphone | dmetaphone | dmetaphone | dmetaphone ------------+------------+------------+------------ HLN | HLN | LSMT | LM Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 10/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Levenshtein Distance The minumum number of operations (delete, insert, substitute) required to transform a string into another. Case sensitive. db=# select levenshtein(’Zürich’,’zuerich’); levenshtein ------------- 3 Normalized distance: d ( a , b ) d norm ( a , b ) = 1 − max ( length ( a ) , length ( b )) db=# SELECT 1 - (levenshtein(’Zürich’,’zuerich’)::REAL / db-# greatest(length(’Zürich’),length(’zuerich’))::REAL) AS levenshtein_norm; levenshtein_norm ------------------- 0.571428567171097 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 11/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Full Text search - 1/8 Full Text Search is a huge chapter in PostgreSQL and would deserve a dedicated session. In this context we can only give an overview of the main characteristics of the technique. Lexeme : A basic lexical unit of a language consisting of one word or several words, the elements of which do not separately convey the meaning of the whole. tsvector : The PostgreSQL equivalent of a collection of lexeme. to_tsvector : Function to convert a text into a collection of lexemes. to_tsquery : A collection of lexemes combined with boolean operators ( &, |, ! ). db=> SELECT to_tsvector(’english’,’There are many nice stories about creating lexemes’); to_tsvector ------------------------------------------------- ’creat’:7 ’lexem’:8 ’mani’:3 ’nice’:4 ’stori’:5 db=> SELECT to_tsquery(’english’,’People & talk & stories & lexemes’); to_tsquery -------------------------------------- ’peopl’ & ’talk’ & ’stori’ & ’lexem’ Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 12/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Full Text search - 2/8 The match operator @@ db=> SELECT to_tsquery(’english’,’People & talk & stories & lexemes’) @@ db-> to_tsvector(’english’,’There are many nice stories about creating lexemes’); ?column? ---------- f db=> SELECT to_tsquery(’english’,’(People & talk) | stories | lexemes’) @@ db-> to_tsvector(’english’,’There are many nice stories about creating lexemes’); ?column? ---------- t Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 13/38
Introduction Simple Matching Fuzzy Matching Use Case Conclusion Full Text search - 3/8 From the PostgreSQL documentation: Text search parsers break documents into tokens and classify each token (for example, as words or numbers). Text search dictionaries convert tokens to normalized form and reject stop words. Text search templates provide the functions underlying dictionaries. (A dictionary simply specifies a template and a set of parameters for the template.) Text search configurations select a parser and a set of dictionaries to use to normalize the tokens produced by the parser. Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 14/38
Recommend
More recommend