Keyword-based Queries Single words - PDF document

Information Retrieval �� ! Yannis Tzitzikas University of Crete CS-463,Spring 05 �� • Keyword-based Queries – Single words Queries – Context Queries • Phrasal Queries • Proximity Queries – Boolean Queries – Natural Language Queries • Pattern Matching – Simple – Allowing errors (Levenstein distance, LCS longest common subsequence ) – Regular expressions • Structural Queries (will be covered in a subsequent lecture) • Query Protocols CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 2 Retrieval 2005 1

�� • O �� υ ��α� �� α ��α ��α��α� α�� υ �� α • �� α �� υ� �� α ��υ�� CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 3 Retrieval 2005 Single-Word Queries CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 4 Retrieval 2005 2

Context-Queries • Ability to search words in a given context, that is, near other words • Types of Context Queries – Phrasal Queries – Proximity Queries CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 5 Retrieval 2005 Phrasal Queries • Retrieve documents with a specific phrase ( ordered list of contiguous words) – “information theory” – “to be or not to be” • May allow intervening stop words and/or stemming. – “ buy camera ” matches: – “buy a camera”, – “buy a camera”, (two spaces) – “buying the cameras” etc. CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 6 Retrieval 2005 3

(inverted index) D j , tf j df Index terms 3 D 7 , 4 computer database D 1 , 3 2 � � � D 2 , 4 4 science system 1 D 5 , 2 Postings lists Index file CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 7 Retrieval 2005 Phrasal Retrieval with Inverted Indices • Must have an inverted index that also stores positions of each keyword in a document. • Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. • Best to start contiguity check with the least common word in the phrase. • �� ”Indexing and Searching” CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 8 Retrieval 2005 4

�� α� (Proximity Queries) • List of words with specific maximal distance constraints between terms. • Example: – “dogs” and “race” within 4 words • will match – “…dogs will begin the race…” • May also perform stemming and/or not count stop words. • The order may or may not be important CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 9 Retrieval 2005 Proximity Retrieval with Inverted Index • Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints. • During binary search for positions of remaining keywords, find closest position of k i to p and check that it is within maximum allowed distance. • �� ”Indexing and Searching” CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 10 Retrieval 2005 5

Boolean Queries • Keywords combined with Boolean operators: – OR: ( e 1 OR e 2 ) – AND: ( e 1 AND e 2 ) – BUT: ( e 1 BUT e 2 ) Satisfy e 1 but not e 2 • Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. • Naïve users have trouble with Boolean logic. �� α��α�� α�� – Primitive keyword: Retrieve containing documents using the inverted index. – OR: Recursively retrieve e 1 and e 2 and take union of results. – AND: Recursively retrieve e 1 and e 2 and take intersection of results. – BUT: Recursively retrieve e 1 and e 2 and take set difference of results. CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 11 Retrieval 2005 �� !υ�� α� (“Natural Language” Queries ) • Full text queries as arbitrary strings. • Typically just treated as a bag-of-words for a vector-space model. • Typically processed using standard vector-space retrieval methods. CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 12 Retrieval 2005 6

Pattern Matching • Allow queries that match strings rather than word tokens. • Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently. Some types of simple patterns: • Prefixes : Pattern that matches start of word. – “anti” matches “antiquity”, “antibody”, etc. • Suffixes : Pattern that matches end of word: – “ix” matches “fix”, “matrix”, etc. • Substrings : Pattern that matches arbitrary subsequence of characters. – “rapt” matches “enrapture”, “velociraptor” etc. • Ranges : Pair of strings that matches any word lexicographically (alphabetically) between them. – “tin” to “tix” matches “tip”, “tire”, “title”, etc. CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 13 Retrieval 2005 More Complex Patterns: Allowing Errors • What if query or document contains typos or misspellings? • Judge similarity of words (or arbitrary strings) using: – Edit distance (Levenstein distance) – Longest Common Subsequence (LCS) • Allow proximity search with bound on string similarity. CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 14 Retrieval 2005 7

Edit (Levenstein) Distance • Minimum number of character deletions , additions, or replacements needed to make two strings equivalent. – “misspell” to “mispell” is distance 1 – “misspell” to “mistell” is distance 2 – “misspell” to “misspelling” is distance 3 • Can be computed efficiently using dynamic programming – O( mn ) time where m and n are the lengths of the two strings being compared. CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 15 Retrieval 2005 Longest Common Subsequence (LCS) • Length of the longest subsequence of characters shared by two strings. • A subsequence of a string is obtained by deleting zero or more characters. • Examples: – “misspell” to “mispell” is 7 – “misspelled” to “misinterpretted” is 7 “mis…p…e…ed” CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 16 Retrieval 2005 8

More complex patterns: Regular Expressions • Language for composing complex patterns from simpler ones. – An individual character is a regex. – Union: If e 1 and e 2 are regexes, then ( e 1 | e 2 ) is a regex that matches whatever either e 1 or e 2 matches. – Concatenation: If e 1 and e 2 are regexes, then e 1 e 2 is a regex that matches a string that consists of a substring that matches e 1 immediately followed by a substring that matches e 2 – Repetition (Kleene closure): If e 1 is a regex, then e 1 * is a regex that matches a sequence of zero or more strings that match e 1 CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 17 Retrieval 2005 Regular Expression Examples • (u|e)nabl(e|ing) matches – unable – unabling – enable – enabling • (un|en)* able matches – able – unable – unenable – enununenable CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 18 Retrieval 2005 9

Enhanced Regex’s (Perl) • Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. • Special repetition operator (+) for 1 or more occurrences. • Special optional operator (?) for 0 or 1 occurrences. • Special repetition operator for specific range of number of occurrences: {min,max}. – A{1,5} One to five A’s. – A{5,} Five or more A’s – A{5} Exactly five A’s CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 19 Retrieval 2005 Perl Regex’s • Character classes: – \w (word char) Any alpha-numeric (not: \W) – \d (digit char) Any digit (not: \D) – \s (space char) Any whitespace (not: \S) – . (wildcard) Anything • Anchor points: – \b (boundary) Word boundary – ^ Beginning of string – $ End of string • Examples – U.S. phone number with optional area code: • /\b($\d{3}$\s?)?\d{3}-\d{4}\b/ – Email address: • /\b\S+@\S+(\.com|\.edu|\.gov|\.org|\.net)\b/ Note: Packages available to support Perl regex’s in Java CS-463, Information Yannis Tzitzikas, U. of Crete, Spring 20 Retrieval 2005 10

Keyword-based Queries Single words - PDF document

Information Retrieval ! Yannis Tzitzikas University of Crete CS-463,Spring 05

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Bayes-Nash Price of Anarchy for GSP Renato Paes Leme va Tardos Cornell University Keyword

A glimpse to sponsored search auctions Maria Serna Fall 2016 AGT-MIRI Sponsored search Keyword

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

02_More_Python October 31, 2019 to be declared before keyword args Functions can have positional

Processing Keyword Queries under Access Limitations Andrea Cal, Thomas Lynch, Davide Martinenghi,

Introductjon to SQL Part 1 Single-Table Queries By Michael Hahsler based on slides for CS145

The Challenges of Marketing Your Home Watch Business First C Chal allenge Virtually no one

Functions in Python Python Functions Functions defined by keyword def Can return value

Sponsored Search Equilibria for Conservative Bidders Renato Paes Leme va Tardos Cornell

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

Lesson 1.4: The art of keyword choice Think about what youre trying to find Choose

Words in Context Sense Examples (keyword in context) . . . used to strain microscopic plant life

Inexact variable metric proximal gradient methods with line-search for convex and nonconvex

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble

Proximity-based Clustering Clustering with no distance information What if one wants to

CS-5630 / CS-6630 Visualization for Data Science Design Guidelines Alexander Lex

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 , Alec Beri 2 , and

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux

Signal analysis using sparse representation and proximal optimization methods Mai Quyen PHAM

CLARINET: WAN-Aware Optimization for Analytics Queries Presented By Robert Claus Agenda 1.