High Performance Solr Shalin Shekhar Mangar Performance constraints - PowerPoint PPT Presentation

High Performance Solr Shalin Shekhar Mangar

Performance constraints • CPU • Memory • Disk • Network 2

Tuning (CPU) Queries • Phrase query • Boolean query (AND) • Boolean query (OR) • Wildcard • Fuzzy • Soundex • …roughly in order of increasing cost • Query performance inversely proportional to matches (doc frequency) 3

Tuning (CPU) Queries • Reduce frequent-term queries – Remove stopwords – Try CommonGramsFilter – Index pruning (advanced) • Some function queries match ALL documents - terribly inefficient 4

Tuning (CPU) Queries • Make efficient use of caches – Watch those eviction counts – Beware of NOW in date range queries. Use NOW/ DAY or NOW/HOUR – No need to cache every filter • Use fq={!cache=false}year:[2005 TO *] • Specify cost for non-cached filters for efficiency – fq={!geofilt sfield=location pt=22,-127 d=50 cache=false cost=50} • Use PostFilters for very expensive filters (cache=false, cost > 100) 5

Tuning (CPU) Queries • Warm those caches – Auto-warming – Warming queries • firstSearcher • newSearcher • Merged Segment Warmer 6

Tuning (CPU) Queries • Stop using primitive number/date fields if you are performing range queries – facet.query (sometimes) or facet.range are also range queries • Use Trie* Fields • When performing range queries on a string field (rare use- case), use frange to trade off memory for speed – It will un-invert the field – No additional cost is paid if the field is already being used for sorting or other function queries – fq={!frange l=martin u=rowling}author_last_name instead of fq=author_last_name:[martin TO rowling] 7

Tuning (CPU) Queries • Faceting methods – facet.method=enum - great for less unique values • facet.enum.cache.minDf - use filter cache or iterate through DocsEnum – facet.method=fc – facet.method=fcs (per-segment) • facet.sort=index faster than facet.sort=count but useless in typical cases 8

Tuning (CPU) Queries • Terms query parser • Large number of terms OR’ed together • ACLs • ReRankQueryParser – Like a PostFilter but for queries! – Run expensive queries at the very last – Solr 4.9+ only (soon to be released) 9

Tuning (CPU) Queries • Divide and conquer – Shard’em out – Use multiple CPUs – Sometime multiple cores are the answer even for small indexes and specially for high-updates 10

Tuning Memory Usage • Use DocValues for sorting/faceting/grouping • There are docValueFormats: {‘default’, ‘memory’, ‘direct’} with different trade-offs. – default - Helps avoid OOM but uses disk and OS page cache – memory - compressed in-memory format – direct - no-compression, in-memory format 11

Tuning Memory Usage • Use _version_ as a doc-values field • Reduce the stack size for threads -Xss especially if you run a lot of cores • termIndexInterval - Choose how often terms are loaded into term dictionary. Default is 128. 12

Tuning Memory Usage • Garbage Collection pauses kill search performance • GC pauses expire ZK sessions in SolrCloud leading to many problems • Large heap sizes are almost never the answer • Leave a lot of memory for the OS page cache • http://wiki.apache.org/solr/ShawnHeisey 13

Tuning Disk Usage • Atomic updates are costlier – Lookup from transaction log – Lookup from Index (all stored fields) – Combine – Index 14

Tuning Disk Usage • Experiment with merge policies – TieredMergePolicy is great but LogByteSizeMergePolicy can be better if multiple indexes are sharing a single disk • Increase buffer size - ramBufferSizeMB • maxIndexingThreads 15

Tuning Disk Usage • Always hard commit once in a while – Best to use autoCommit and maxDocs – Trims transaction logs – Solution for slow startup times • Use autoSoftCommit for new searchers • commitWithin is a great way to commit frequently 16

Tuning Network • Batch writes together as much as possible • Use CloudSolrServer in SolrCloud always – Routes updates intelligently to correct leader • ConcurrentUpdateSolrServer (previously known as StreamingUpdateSolrServer) for indexing in non-Cloud mode – Don’t use it for querying! 17

Tuning network • Share HttpClient instance for all Solrj clients or just re-use the same client object • Disable retries on HttpClient 18

Tuning Network • Distributed Search is optimised if you ask for fl=id,score only – Avoid numShard*rows stored field lookups – Saves numShard network calls – Use distrib.singlePass parameter to force this optimisation – Use /get for lookup by id 19

Tuning Network • Consider setting up a caching proxy such as squid or varnish in front of your Solr cluster – Solr can emit the right cache headers if configured in solrconfig.xml – Last-Modified and ETag headers are generated based on the properties of the index such as last searcher open time – You can even force new ETag headers by changing the ETag seed value – <httpCaching never304=“true”><cacheControl>max- age=30, public</cacheControl></httpCaching> – The above config will set responses to be cached for 30s by your caching proxy unless the index is modifed. 20

Avoid wastage • Don’t store what you don’t need back – Use stored=false • Don’t index what you don’t search – Use indexed=false • Don’t retrieve what you don’t need back – Don’t use fl=* unless necessary – Don’t use rows=10 when all you need is numFound 21

Reduce indexed info • omitNorms=true - Use if you don’t need index-time boosts • omitTermFreqAndPositions=true - Use if you don’t need term frequencies and positions – No fuzzy query, no phrase queries – Can do simple exists check, can do simple AND/OR searches on terms – No scoring difference whether the term exists once or a thousand times 22

DocValue tricks & gotchas • DocValue field should be stored=false, indexed=false • It can still be retrieved using fl=field(my_dv_field) • If you store DocValue field, it uses extra space as a stored field also. – In future, update-able doc value fields will be supported by Solr but they’ll work only if stored=false, indexed=false • DocValues save disk space also (all values, next to each other lead to very efficient compression) 23

Distributed Deep paging • Bulk exporting documents from Solr will bring it to its knees • Enter deep paging and cursorMark parameter – Specify cursorMark=* on the first request – Use the returned ‘nextCursorMark’ value as the nextCursorMark parameter 24

Distributed deep paging 25

Thank you shalin@apache.org twitter.com/shalinmangar

High Performance Solr Shalin Shekhar Mangar Performance constraints - PowerPoint PPT Presentation

High Performance Solr Shalin Shekhar Mangar Performance constraints CPU Memory Disk Network 2 Tuning (CPU) Queries Phrase query Boolean query (AND) Boolean query (OR) Wildcard Fuzzy Soundex roughly

Drupal and Solr Saturday, August 30, 2008 1 Hello Im Alexandru Badiu Drupal and Solr -

Beyond the Solr Eclipse Building blazing fast Drupal 8 search with Solr and no code TANAY SAI

Apache Solr An experience report 2013-10-23 - Corsin Decurtins Apache Solr Notes Full-Text

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

optimizations for e-commerce search with Apache Solr Tomasz Sobczak, MICES 2017 About me Work

SoLR Cost Allocation Changes Gareth Evans Purpose Purpose of presentation is to go through

Tool Time Update: WCAIS Division of Claims Management Solr Advanced Search Tool 1 1. Enhanced

SOLR-8542 #haystackconf EU keynote Doug Turnbull We need to step into our time machines

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO

Faceted Searching With Apache Solr October 13, 2006 Chris Hostetter hossman apache org

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who

Building and Running a Solr-as-a-Service SHAI ERERA IBM Who Am I? Working at IBM Social

Bootstrapping Solr search clusters and maintain them using Puppet All you ever wanted to know

Solar PV Site Assessment and Proposals A Contractors Perspective - SOLR 520 The objectives of

Solr Bioenergy Group Company Presentation November 2016 Strictly confidential Company

Part 5 Global Marketing Strategies Resource Person MATHISHA HEWAVITHARANA MBA (Col),BBA

Firms, trade costs and FDI Giovanni Marin Department of Economics, Society, Politics Universit

OpenCms Days 2011 Workshop Track: Upgrading from OpenCms 7.5 to OpenCms 8 Michael Emmerich,

Payzones New Contract Payzones New Contract FACTUAL INFORMATION PAYZONE RETAILERS NEED TO

Introduction Foreign Direct Investment (FDI): An Observation about Tourism Sector of Bhutan

The Impact of Foreign Direct Investment on the Wacker Developing Countries Terms of Trade

William A. Reinsch President, National Foreign Trade Council June 25, 2007 Emerging Markets Now

Com m ents on Tax and Econom ic Grow th Bob Reed Bob Reed Department of Economics and Finance

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us