1 Query Languages R B R Baeza Yates and B R. Baeza Baeza-Yates and B Yates and B Riberio Yates and B. Riberio Riberio Neto Riberio-Neto Neto Neto Modern Information Retrieval, Chapter 4 Modern Information Retrieval, Chapter 4 Jon Atle Gulla TDT4215 Query Languages 2 Query Languages • How do specify your information needs? • Query types: – Keyword-based querying Keyword based querying – Pattern matching – Structural queries •NTNU/IDI/IS TDT4215 Query Languages
3 Keyword-based querying Single-word queries skiing TDT4215 exam NTNU Trondheim skiing norway snowboarding skiing norway snowboarding • Result is a set of documents containing at least one of the words of the query • Documents ranked according to relevance Documents ranked according to relevance • Web extensions: skiing telemark skiing +telemark skiing -telemark TDT4215 Query Languages 4 Keyword-based querying Context queries Words appearing near each other may signal a higher relevance than Words appearing near each other may signal a higher relevance than words far apart • Phrases: – a sequence of single word queries “new york times” “to be or not to be” “olympic games” london “ l i ” l d • Proximity – a sequence of words is given together with a maximum allowed distance q g g between them ntnu trondheim “ …the university in trondheim is ntnu…” “…ntnu is situated in trondheim…” eggen rbk bk TDT4215 Query Languages
5 Phrasing or not phrasing Query new york times • How to deal with queries that have potential “new york times” phrases? – How to recognize a potential phrase? “new york” times – How to interpret potential phrases? new york times • Interpretation affects ranking!! TDT4215 Query Languages 6 Proximity search D Documents Query it's 3 pm in New York , what time is it in the new york times rest of the world? .For your reading pleasure, we present historic issues from the New York Times . City of York Council - list of new library opening times and addresses opening times and addresses. • Which document is the most relevant one? .Three webcam views of Times Square, • How do we achieve this How do we achieve this New York . N Y k ranking? TDT4215 Query Languages
7 Keyword-based querying Boolean queries • Boolean operators: – OR (e1 OR e2) – AND (e1 AND e2) – BUT (e1 BUT e2) NOT BUT (e1 BUT e2) NOT • No ranking of documents provided • “Fuzzy boolean”: Meaning of AND and OR relaxed Natural language: • • Query is an enumeration of words and context queries Query is an enumeration of words and context queries • All documents matching a portion of the user query are retrieved • Higher ranking is assigned to those documents matching more parts of the query • Q Query and documents viewed as vectors d d t i d t TDT4215 Query Languages 8 Pattern matching • A pattern is a set of syntactic features that must occur in a text segment, ranging from simple (e.g. words) to complex (e.g. regular expressions) terms • Typical patterns: – words – – prefixes ‘comput’ -> ‘computer’ ‘computation’ ‘computing’ prefixes comput -> computer , computation , computing – suffixes. ‘ters’ -> ‘computers’, ‘testers’, ‘printers’ – sub-strings. ‘tal’ -> ‘coastal’, ‘talk’, ‘metallic’ – ranges. ‘held’ and ‘hold’ -> ‘hoax’, ‘hissing’ ranges. held and hold hoax , hissing – allowing erros – regular expressions – extended patterns TDT4215 Query Languages
9 Structural queries • Allowing the user to query documents based on their structure (not on their content) • Mixing content and structure in query allows us to post more expressive queries • Three main structures: – form-like fixed structures f lik fi d t t – hypertext structures – hierarchical structures TDT4215 Query Languages 10 Fixed structure Fixed structure • Document has a fixed set of fields, much like a filled form • Intended for document collections with fixed structures • Example – Mail archive as a set of mails – Each mail has a standard set of fields: • sender sender • receiver • subject • date • body – User can search for mails sent to a given person with ”football” in the subject field • Leads to the relational model – Extend SQL to full text retrieval -> SFQL TDT4215 Query Languages
11 Hypertext • Hypertext is a directed graph where the nodes hold some text and the links represent connections between nodes • Search by following hyperlinks • “give me documents that link to X” TDT4215 Query Languages 12 Hierarchical structure • Hierarchical structure is an intermediate structuring model that lies between fixed structure and hypertext structure • Sample of hierarchical models: Sample of hierarchical models: – PAT expressions Structure is marked in the text as tags (e.g. HTML) – Overlapped lists Hierarchical partly overlapping regions of text defined – Lists of references Lists of references Querying path expressions in text – Proximal nodes Many fixed hierarchical structures of text defined y – Tree matching Document and query gives a tree structure TDT4215 Query Languages
13 Conclusions • Query types: – Keyword-based queries: • Single-word queries • Context queries q • Boolean queries • Natural language – Pattern matching Pattern matching – Structural queries: • Fixed structure • Hypertext Hypertext • Hierarchical structure TDT4215 Query Languages
Recommend
More recommend