1
play

1 A Comparison of Open Source Search A Comparison of Open Source - PDF document

Open Source Search Engines Why? Low cost: No licensing fees Source code available for customization Good for modest or even large data sizes Open-Source Search Engines and Challenges: Lucene/Solr Performance, Scalability


  1. Open Source Search Engines • Why?  Low cost: No licensing fees  Source code available for customization  Good for modest or even large data sizes Open-Source Search Engines and • Challenges: Lucene/Solr  Performance, Scalability  Maintenance UCSB 290N 2013. Tao Yang Slides are based on Y. Seeley, S. Das, C. Hostetter 2 1 Open Source Search Engines: Examples A Comparison of Open Source Search Engines • Lucene • Middleton/Baeza-Yates 2010 (Modern Information Retrieval. Text book)  A full-text search library with core indexing and search services  Competitive in engine performance, relevancy, and code maintenance • Solr  based on the Lucene Java search library with XML/HTTP APIs  caching, replication, and a web administration interface. • Lemur/Indri  C++ search engine from U. Mass/CMU 3 1

  2. A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines • Middleton/Baeza-Yates 2010 (Modern Information Retrieval) • July 2009, Vik’s blog (http://zooie.wordpress.com/2009/07/06/a - comparison-of-open-source-search-engines-and-indexing-twitter/) A Comparison of Open Source Search Lucene Engines • Developed by Doug Cutting initially • Vik’s blog(http://zooie.wordpress.com/2009/07/06/a -comparison-of-open-source-search-engines-and-indexing-twitter/) – Java-based. Created in 1999, Donated to Apache in 2001 • Features  No crawler, No document parsing, No “ PageRank ” • Powered by Lucene – IBM Omnifind Y! Edition, Technorati – Wikipedia, Internet Archive, LinkedIn, monster.com • Add documents to an index via IndexWriter  A document is a collection of fields  Flexible text analysis – tokenizers, filters • Search for documents via IndexSearcher Hits = search(Query,Filter,Sort,topN) • Ranking based on tf * idf similarity with normalization 2

  3. Lucene’s input content for indexing Example of Inverted Indexing aardvark Field Document 0 Field Little Red Riding Hood Field Document Name Value hood 0 1 Field Document Field little 0 2 1 • Logical structure Robin Hood  Documents are a collection of fields red 0 – Stored – Stored verbatim for retrieval with results riding 0 – Indexed – Tokenized and made searchable robin 1  Indexed terms stored in inverted index 2 Little Women • Physical structure of inverted index  Multiple documents stored in segments women 2 • IndexWriter is interface object for entire zoo index 9 Faceted Search/Browsing Example Indexing Flow LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000 WordDelimiterFilter catenateWords=1 Lex Corp BFG 9000 LexCorp LowercaseFilter lex corp bfg 9000 lexcorp 11 3

  4. Lucene Index Files: Field infos file (.fnm) Analyzers specify how the text in a field is to be indexed  Options in Lucene – WhitespaceAnalyzer Format:  divides text at whitespace FieldsCount, <FieldName, FieldBits> – SimpleAnalyzer FieldsCount the number of fields in the index  divides text at non-letters FieldName the name of the field in a string  convert to lower case FieldBits a byte and an int where the lowest – StopAnalyzer bit of the byte shows whether the  SimpleAnalyzer  removes stop words field is indexed, and the int is the id – StandardAnalyzer of the term  good for most European Languages  removes stop words 1, <content, 0x01>  convert to lower case – Create you own Analyzers 13 http://lucene.apache.org/core/3_6_2/fileformats.html 14 Lucene Index Files: Term Dictionary file (.tis) Lucene Index Files: Term Info index (.tii) Format : TermCount, TermInfos TermInfos <Term, DocFreq> Format : IndexT ermCount, IndexInterval, T ermIndices Term <PrefixLength, Suffix, FieldNum> T ermIndices <T ermInfo, IndexDelta> This file is sorted by Term. Terms are ordered first lexicographically by This contains every IndexInterval th entry from the .tis file, along with its the term's field name, and within that lexicographically by the term's location in the "tis" file. This is designed to be read entirely into memory text and used to provide random access to the "tis" file. TermCount the number of terms in the documents determines the position of this term's T ermInfo within IndexDelta Term Term text prefixes are shared. The PrefixLength is the the .tis file. In particular, it is the difference between the number of initial characters from the previous term position of this term's entry in that file and the position which must be pre-pended to a term's suffix in order to of the previous term's entry. form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is 4,<football,1> <penn,3><layers,2> <state,1> two and the suffix is "y". FieldNumber the term's field, whose name is stored in the .fnm file 4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2> Document Frequency can be obtained from this file . 15 16 4

  5. Lucene Index Files: Frequency file (.frq ) Lucene Index Files: Position file (.prx ) Format : <TermFreqs> Format : <TermPositions> TermFreqs TermFreq TermPositions <Positions> TermFreq DocDelta, Freq? Positions <PositionDelta > TermFreqs are ordered by term (the term is implicit, from the .tis file). TermPositions are ordered by term (the term is implicit, from the .tis file). TermFreq entries are ordered by increasing document number. Positions entries are ordered by increasing document number (the document DocDelta determines both the document number and the frequency. In number is implicit from the .frq file). particular, DocDelta/2 is the difference between this PositionDelta the difference between the position of the current occurrence document number and the previous document number (or in the document and the previous occurrence (or zero, if this zero when this is the first document in a TermFreqs). When is the first occurrence in this document). DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as the next Int. For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth For example, the TermFreqs for a term which occurs once in term in a subsequent document, would be the following document seven and three times in document eleven would sequence of Ints: 4, 5, 4 be the following sequence of Ints: 15, 8, 3 <<2, 2, 3> <3> <5> <3, 3>> <<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>> Term Frequency can be obtained from this file. 17 18 Query Syntax and Examples Query Syntax and Examples • Range • Terms with fields and phrases – date:[05072007 TO 05232007] (inclusive)  Title:right and text: go – author: {king TO mason} (exclusive)  Title:right and go ( go appears in default field • Ranking weight boosting ^ “text”)  title:“Bell” author:“Hemmingway”^3.0  Title: “the right way” and go  Default boost value 1. May be <1 (e.g 0.2) • Proximity – “quick fox”~4 • Boolean operators: AND, "+", OR, NOT and "-" • Wildcard  “Linux OS” AND system – pla?e (plate or place or plane)  Linux OR system, Linux system – practic* (practice or practical or practically)  +Linux system • Fuzzy (edit distance as similarity)  +Linux – system – planting~0.75 (granting or planning) • Grouping – roam~ (default is 0.5)  Title: (+linux +”operating system”) 5

  6. Searching Searching: Example • Concurrent search query handling: LexCorp BFG-9000 Lex corp bfg9000  Multiple searchers at once  Thread safe WhitespaceTokenizer WhitespaceTokenizer • Additions or deletions to index are not reflected in LexCorp BFG-9000 Lex corp bfg9000 already open searchers  Must be closed and reopened WordDelimiterFilter catenateWords=1 WordDelimiterFilter catenateWords=0 • Use commit or optimize on indexWriter Lex Corp BFG 9000 Lex corp bfg 9000 LexCorp LowercaseFilter LowercaseFilter lex corp bfg 9000 lex corp bfg 9000 lexcorp A Match! Query Processing Scoring Function is specified in schema.xml • Similarity Field info score(Q,D) = coord(Q,D) · queryNorm(Q) (in Memory) · ∑ t in Q ( tf(t in D) · idf(t) 2 · t.getBoost() · norm(D) ) Query • term-based factors Term Info Index (in Memory) – tf(t in D) : term frequency of term t in document d Constant time  default raw term f requency – idf(t) : inverse document frequency of term t in the Position File Frequency File Term Dictionary entire corpus (Random file (Random file (Random file access) access)  default access)   ln[ N Docs /( docFreq 1 )] 1 23 24 6

Recommend


More recommend