stop stupid fuzzy searches
Table of contents 01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise
01 Fuzzy search
Why we need it / Distribution of spelling errors 100% 90% 80% 70% 60% 50% 37% 40% 26% 30% 23% 20% 18% 20% 15% 14% 12% 11% 10% 9% 10% 5% 0% Edit distance 0 Edit distance 1 Edit distance 2 Edit distance >2 Singluar & Plural Decomposition Frequency Value/Search
Why we need it / Distribution of spelling errors by device type 100% 90% 80% 70% 60% 50% 40% 34% 27% 30% 25% 23% 17% 16% 20% 14% 11% 10% 10% 7% 6% 10% 0% Insert Delete Replace Transpose Singular & Plural Decomposition Desktop Mobile
Causes of spelling errors query Result size Query-Intent Error-type -spannbettlaken 1% 0 format 4% spann-bettlaken 3% 83 spannbettlacken 13% 56 phonetic 22% spanbettlaken 9% 50 spannbettlaken 61% spannbettllaken 7% 47 typo 8% spammbettlaken 1% 0 Spann bettlaken 4% 43 decomposition 5% Bettlaken zum spannen 1% 0 …42 additional spellings
How it works EditDistance 1 EditDistance 2 GET catalog/products/_search GET catalog/products/_search { { “query”: { “query”: { “fuzzy”: { “fuzzy”: { “title”: { “title”: { “value”: “spannbettlacken”, “value”: “spannbettlacken”, “fuzziness”: 2 “fuzziness”: 1 } } } } } } } } generates generates 835 ~650k candidates candidates
Resulting in / high recall but low precision 1 0,9 0,8 0,7 0,6 Precision (PREC) 0,5 0,4 0,3 0,2 0,1 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 Recall (TPR)
Resulting in / low search throughput ~0.1 seconds for spelling a short word 35000 30000 25000 Searches per Second 20000 15000 10000 5000 0 1 2 3 4 5 6 Query Terms or - term and - term or - fuzzy 2 and - fuzzy 2
Observations + - Searches for all Increased CPU usage possible candidates and query response inside a given edit- time distance Inconsistent and not Natively implemented always relevant results in Elasticsearch and Lucene Skewed search analytics
02 Smart Query rewriting MAKE FUZZY SEARCH AS FAST, EASY AND RELEVANT AS EXACT SEARCH
Our Solution / smart query rewrites Cluster similar spannbettlaken Queries spann-bettlaken spannbettlacken MasterQuery Search Engine schpanbettlaken spannbettlaken spannbettllaken spammbettlaken spanmbettlaken Test & Select MasterQuery spannbettlaken
Our Solution / smart query rewrites Cluster similar spannbettlaken Queries spann-bettlaken Based on deep learning & crafted algorithms we clean and cluster queries with spannbettlacken the same meaning schpanbettlaken We use the concept of controlled precision reduction spannbettllaken spammbettlaken Exact Match spanmbettlaken Fingerprint spannbettlaken Lemmatization & Phonems Fuzzy Match
Our Solution / smart query rewrites Test & Select spannbettlaken MasterQuery spann-bettlaken Based on tracking KPIs and deep learning and spannbettlacken global parameter optimization we schpanbettlaken test & select the query which maximises the spannbettllaken balance between the search result interaction spammbettlaken probability and the economic outcome spanmbettlaken spannbettlaken
CXP search|hub / Query Intelligence Platform Solr Elasticsearch Frontend Search Search Engine FACT-Finder Endpoint Fredhopper Celebros Algolia ACS High performance Data|hub Da Caching & Logging Semantic Query Parsing Site Search Analytics Guided Selling Personalization Sm Smart|Quer uery … Query Segmentation Query Scoping
03 Conclusion
Impact – top-10 ecom player A Uses an already a highly optimized state-of-the-art eCommerce Search solution w/o smart|query w smart|query 140% 130% 120% 110% 100% 90% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Impact – top-50 ecom player B Uses an optimized SolR implementation w/o smart|query w smart|query 140% 130% 120% 110% 100% 90% Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Resulting in / High recall & high precision 1 1 0,95 0,9 0,85 0,99 0,8 Precision (PREC) Recall (TPR) 0,75 0,7 0,98 0,65 0,6 0,55 0,5 0,97 0 500.000 1.000.000 1.500.000 2.000.000 2.500.000 3.000.000 3.500.000 4.000.000 4.500.000 5.000.000 Queries Recall (TPR) Precision (PREC)
Resulting in / insane query performance ~0.00005 seconds for spelling a short word – 80 ops/ms 35000 Searches per Seconds search|hub & Elastic 30000 25000 20000 15000 10000 5000 0 1 2 3 4 5 6 Query Terms or - term and - term or - fuzzy 2 and - fuzzy 2
Observations + - more relevant results additional complexity consistent results reduced manual effort for curated search results save CPU usage improved query response time consistent site search analytics
04 Surprise CXP smart|query- PreDictLib fast & accurate spell correction at scale
search|hub -PreDictLib fast & accurate spell correction at scale Qui Quick Highl hlight hts: extremely fast & constant index § access truly language independent edit § distance ability to add records to the index § at runtime without performance decrease based on one of the most efficient spell correction implementations out there called symspell by Wolf Grabe
Symspell/ some Benchmarks Throughput vs Accuracy 100,0% 100% 88,7% 88,7% 88,3% 90% 80% 69,2% 70% 60% 45,8% 50% 40% 30% 20% 10% 2,2% 1,7% 1,0% 1,0% 0% Lucene WordCorrect ElasticSearch No.2 eCommerce No.1 in eCommerce SymSpell WordCorrect Search Search Accuracy Searches/sec
search|hub -PreDictLib fast & accurate spell correction at scale modified edit distance to a • weighted edit distance changed Damerau Levenshtein • distance with a weighted Damerau Levenshtein distance – taking into account keyboard distance re-rank the candidate list by • applying additional similarity algorithms
Search|hub– PreDice(CE) & PreDict(EE) / some Benchmarks Throughput vs. Accuracy 100% 99% 98% 100% 89% 89% 89% 88% 86% 90% 80% 69% 70% 60% 46% 50% 40% 30% 20% 10% 2% 1% 1% 1% 0% Lucene ElasticSearch No.2 No.1 in Symspell CXP PreDict CXP Searchhub WordCorrect WordCorrect eCommerce eCommerce (CE) Search Search Accuracy Searches/sec
what you‘ll get CXP SmartQuery – PreDictLib (CE) fast & accurate spell correction at scale the Lib as Java source § accuracy and benchmark tests § real-life test data § ht https://gi github. b.com/se searchhub/pr preDict
Questions
Recommend
More recommend