3 text and document databases
play

3. Text and document databases Normal databases: formatted records; - PowerPoint PPT Presentation

3. Text and document databases Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries


  1. 3. Text and document databases � Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML). � Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries /encyclopedias - Electronic newspapers - Source program libraries - Automated law and patent offices � What is a ‘document’? E.g. a book, chapter, paragraph, article, letter, web page, source program, etc. � General problem setting: Searching documents by contents ; often called also associative search. � Usually based on keywords or terms occurring in documents. � Search terms may be combined with Boolean connectives (AND, OR, NOT) MMDB-3 J. Teuhola 2012 39

  2. Non-indexed methods for string matching � Sequential full-text scanning; slow but some advantages: - No extra disk space required - Updates are fast (no index maintenance) - Partial-match retrieval is rather simple (using wildcard characters) - Approximate matching is also possible (using a threshold for edit distance ) � Popular efficient algorithm: Boyer-Moore technique - Peculiar feature: searching is faster for longer search strings. - Based on preprocessing of the search string - Performance is sublinear in practice - Disk accesses cannot be reduced, except for extremely long search strings. � Several other string matching algorithms exist (skipped). MMDB-3 J. Teuhola 2012 40

  3. Inverted indexing � Traditional way of improving search speed. � What means ‘inverted’? A document is a list of words, but the index gives for each word the list of documents where the word appears. Example documents: Inverted index: BE � {D 1 , D 2 , D 3 } D 1 : ”TO BE OR NOT TO BE” DO � D 2 : ”TO BE IS TO DO” {D 2 , D 3 } IS � D 3 : ”DO BE DO BE DO” {D 2 } NOT � {D 1 } OR � {D 1 } TO � {D 1 , D 2 } MMDB-3 J. Teuhola 2012 41

  4. Inverted indexing (cont.) � The set of words is called a lexicon . Some principles for it: - Case folding : Convert uppercase letters to lowercase - Stemming : Remove suffixes; index only the root forms of terms. - Do not include stopwords , like “the”, “is”, “as”, “that”, etc. which occur very often but do not bear semantic relevance. � The pointers to term occurrences may appear in different granularities : - Coarse-grained index identifies document groups where the term appears. - Moderate-grained index identifies the relevant documents - Fine-grained index contains sentence, word, or even byte numbers for term occurrences. MMDB-3 J. Teuhola 2012 42

  5. Inverted indexing (cont.) � Coarse-grained index: - Small index size - Small maintenance penalty - Lot of plain text scanning - False drops for multi-term queries (terms do not co-occur). � Fine-grained index: - Large index size - High maintenance penalty - Supports proximity queries (terms occurring together) MMDB-3 J. Teuhola 2012 43

  6. Inverted indexing (cont.) Ways to save storage space: � Front compression : Index in alphabetic order; the prefix common with the previous term is expressed compactly. � Tail ( suffix ) compression : Store terms to the point where they can be uniquely distinguished from other terms. � In fine-grained index: Instead of full pointers, store intervals of successive occurrences of a term. Compound queries: � AND: Retrieve pointer lists and compute their intersection . � OR: Retrieve pointer lists and form their union . � NOT: This is usually combined with AND, so that we can apply set difference to the pointer lists of terms. MMDB-3 J. Teuhola 2012 44

  7. Data structures for inverted indexes (a) B + -tree , with index terms as keys and pointer lists as leaves. (b) Hash organization , preferably a dynamic version, e.g. - Linear hashing - Extendible hashing (c) Trie -structure : each node represents one character, and the term is found by following a path from the root to a leaf. Problem: must be mapped to external storage In each case, the variable-length pointer lists can be either locally in the structure, or (preferably) detached. MMDB-3 J. Teuhola 2012 45

  8. Algorithm for building an inverted index 1. For each document, gather the index terms, combined with pointers to the actual locations. The result is a big sequential file (called S). 2. Sort S by using e.g. external mergesort . 3. Combine the successive entries representing the same index term. 4. Build the index (B + -tree, hash table, ...) on the index terms and let the leaf entries refer to the detached pointer lists, stored as variable-length records. Assessment of inverted indexes: � A very effective access method for retrieval. � A very popular technique in practice for static document sets (in spite of the storage penalty). � Presumably the main retrieval tool in web search engines . MMDB-3 J. Teuhola 2012 46

  9. Bitmap indexing � Another representation for the inverted index � Suitable for coarse and moderate-grained indexing � A bitmap is a matrix with a row for each index term , and a column for each document. Element < i , j > is 1, if term i occurs in document j , otherwise 0. d1 d2 d3 BE 1 1 1 Example documents: DO 0 1 1 d1. ”TO BE OR NOT TO BE” IS 0 1 0 d2. ”TO BE IS TO DO” NOT 1 0 0 d3. ”DO BE DO BE DO” OR 1 0 0 TO 1 1 0 [Note: Normally these index terms would be stopwords .] MMDB-3 J. Teuhola 2012 47

  10. Bitmap indexing (cont.) � Especially efficient for Boolean queries: AND, OR, NOT can be implemented directly in hardware (e.g. 64 bits in parallel). � Problem: High storage consumption (#terms × #documents); the matrix is usually sparse . � Possible combined structure: Use an inverted index for the less frequent terms, and a bitmap for the more frequent terms. � One option: compression . E.g. run-length coding : Replace sequences of zeroes by their count (which gets close to the normal inverted index). Also the encoding of integers has to be decided. Example: Bitmap row = 001010000010001100000100... Run-length code = <2, 1, 5, 3, 0, 5, ...> MMDB-3 J. Teuhola 2012 48

  11. Hierarchical compression of bit strings � Divide the string into equal-sized blocks, � Apply disjunction (OR) to the bits within each block, creating a higher-level bit: A block of zeroes generates a 0-bit, others a 1-bit. � The process is repeated on higher levels, recursively. � Advantage: Single bits are easily accessible, by studying one path in the tree. 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 � Compression: zero blocks need not be stored, at all. Most of the leaf blocks are usually zero blocks (sparse bitmap). MMDB-3 J. Teuhola 2012 49

  12. Signature indexing � A probabilistic technique � Can be generalized to any objects characterized by a variable-size set of index terms or descriptors (also called features ). � Signature is a bit-array, generated by hashing the index terms to indexes of the array. It is usually at least some hundreds of bits long. � Signatures are collected to a separate signature file , which is smaller than the whole space required by the documents. � The signature file acts as a filtering mechanism , to reduce the amount of actual data to be searched. � The structure enables partial-match retrieval (subset of terms match) � Queries can also be considered a kind of documents (collection of keywords), and be converted into signature form. The query signature Q is compared against document signatures D i . If 1-bits of Q are included in 1-bits of D i , then D i is a candidate result. � Signatures are approximate descriptions of documents. False drops must be eliminated by checking the actual match of all candidates. MMDB-3 J. Teuhola 2012 50

  13. Signature indexing (cont.) Advantages of signature files: � Low storage consumption, compared to (fine-grained) inverted indices. Typical value is 10-20% of the primary database size � More convenient than multidimensional indexes (to be studied later) Signature generation methods: � Word signature method: Each index term is hashed into a sparse bit pattern, and the patterns are concatenated to form the document signature. This method usually results in higher false drop probability, but preserves sequencing information of terms in documents. � Superimposed coding : Each index term produces a bit pattern of full signature size. The patterns are OR’ed to form the document signature. This is here the default method. How to minimize false drops? � The proportions of 0’s and 1’s should be equal and uniformly distributed. ( Weight = number of 1-bits in the signature) MMDB-3 J. Teuhola 2012 51

  14. Signature indexing: Example Document index terms: D 1 : database, object, programming, schema D 2 : algorithm, computer, programming D 3 : algorithm, data structure, programming Hashing: hash(“algorithm”) = 3 hash(“computer”) = 1 hash(“database”) = 7 hash(“data structure”) = 5 hash(“object”) = 6 hash(“programming”) = 4 hash(“schema”) = 1 Signatures: D 1 = 10010110, D 2 = 10110000, D 3 = 00111000 Query: Documents about “algorithm” and “programming”? Query signature: Q = 00110000, matches with D 2 and D 3 . MMDB-3 J. Teuhola 2012 52

Recommend


More recommend