Simulation of Within-Session Query Variations using a Text - PowerPoint PPT Presentation

Simulation of Within-Session Query Variations using a Text Segmentation Approach Debasis Ganguly Johannes Leveling Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland

Outline Introduction to query reformulation Automatic generation of query reformulations Characteristics of the reformulations in terms of retrieval results Evaluation Conclusions and future work

Query Reformulation Types in IR Specialization: more particular information need than in previous query Example: “Mahatma Gandhi”  “Mahatma Gandhi non-violence movement” Generalization: more general information than in previous query Example: “Mahatma Gandhi assassination”  “Mahatma Gandhi life and works” Query drift: move toward related but different information need Example: “Mahatma Gandhi assassination”  “Gandhi film”

Motivation Hypothesis: Automatic query reformulations can be used to simulate user query sessions Objective: Simulate query sessions in large quantities – Less time-consuming – Less expensive – No privacy issues – Independent from real data Simulated query sessions can help in Session IR tasks: goal is to improve the IR effectiveness over an entire query session for a user Collaborative IR tasks: goal is to improve the IR effectiveness of a new user by utilizing user responses to related queries

Specialization Initial query: “wildlife poaching” Very general search; no restrictions on particular animal species or locations wildlife poaching After reading two documents, the user now knows that poaching is frequent for “African lions” African lions Indian tigers and “Indian tigers” Adding these words make the query more specific

Generalization Initial query: “osteoporosis” Specific information request (by using technical term); user may not be sure what it actually means Document about bone diseases in general Osteoporosis with a dedicated section on osteoporosis After reading the document, the user knows that “osteoporosis” is a type of Bone “bone disease” Bone osteoporosis Substituting “osteoporosis” with the words “bone” and “disease” now Bone means the user is interested in “bone Bone diseases” in general instead of one particular bone disease

Text Segmentation Documents are composed of a series of densely discussed subtopics Text segmentation draws boundaries between topic shifts The moon's chemical composition Introduction – The search for life in space How the moon helped life evolve on earth Example from M. Hearst. CL. 1997.

Term Distribution Perform text segmentation to get blocks of coherent text passages Terms densely topic 1 topic 2 distributed in a sub-topic are useful for specific Term 1: dominant in topic 1 reformulations Term 2: dominant in topic 2 Terms uniformly distributed Term 3: general term throughout a document are useful for general reformulations

Algorithm for Automatic Query Reformulation Use top ranked documents from an initial retrieval step as external information for reformulations Categorize terms into two classes – specific and general by computing their distribution into the segments of the top ranked documents Generate candidate query reformulations – Add the most specific terms from the most/least similar segments of documents to the original query to get a more specific/drifting query – Substitute original query terms with more general terms as obtained from the pseudo-relevant set of documents • Rank by score and select the best N variants

Term Scores Specialization/drift score: combine – term frequency in segment, – inverse segment frequency, and – idf Generalization score: combine – term frequency in document, – segment frequency, and – idf • Combination in mixture model (see paper for details)

Result Set Characteristics Specialization: – Smaller set of relevant documents (queries are typically longer) – Top ranked documents for the original query become more general with respect to the specific reformulated query but are still relevant (overlap in top ranked documents) Generalization: – Larger set of relevant documents (queries are typically shorter) – Low overlap and high shift of top ranked documents retrieved in response to the original query

Evaluation Measures Two measures: Overlap of retrieved documents at cut-off 10, 20, 50 and 500: O(N) – Net perturbation of top m documents: 1/ m Σ k=1 m new_rank(d k )-k p(N) Expected observations: – High overlap and low perturbation for specialized queries – Low overlap and high perturbation for general queries

Experiments TREC disk 4+5 documents TREC-8 topics: – Topic titles as initial queries for specific and drift reformulations – Topic description as initial queries for general reformulations • Top 5 documents retrieved by LM (lambda=0.4) • C99 algorithm for text segmentation • Added at most 3 specific terms for specific/drift reformulations • Retained at most 2 terms from for the general reformulations • Generated query variants • Judged query variants manually (by two assessors)

Results Type Manual Assessment Result Set Measures Assessor-1 Assessor-2 O(10) O(20) O(50) O(500) p(5) Specific 39 (78%) 26 (52%) 39.0 38.1 42.7 44.7 367.9 General 39 (78%) 43 (86%) 22.4 22.5 24.5 32.2 2208.6 Drift 34 (68%) 35 (70%) 12.0 10.2 8.6 5.9 3853.3 Highest inter-assessor agreement for drift since a drift in information need is not subject to personal judgements Lowest inter-assessor agreement for specific reformulations since semantic specificity of added words can depend on personal judgement Specific and general reformulations which are associated with an increase in overlap percentage with increasing cut-off rank indicate that we get more “seen” documents further down the ranked list

Sample Output of Specific Reformulation Specific reformulations Assessor 1 agrees Assessor 1 disagrees Assessor 2 agrees behavioural genetics cosmic events chromosomes DNA magnitude proton ion genome Assessor 2 disagrees N/A salvaging, shipwreck, treasure found aircraft Rotterdam • Specific reformulations involve adding new words which ought to be semantically related to the original keywords, and the degree of semantic closeness is often subject to personal judgments • One of the assessors does not agree that adding the words “magnitude”, “proton” and “ion” make the initial query “cosmic events” more specific

An Irish Perspective on Query Reformulation … Wildlife poaching → Elephants → Tigers → Beer ? Images from Flickr

… no, just a typo! Wildlife poaching → Elephants → Tigers → Beer → Bear Images from Flickr

Conclusions and Further Work Our proposed method can be used to produce query reformulations with an average accuracy of 65%, 82% and 69% for the specialization, generalization and drift reformulation, respectively We introduced metrics such as the average percentage overlap at fixed number of documents and the average net perturbation to quantify the retrieval result set changes Investigate relation between relevant documents for original query and relevant documents for query reformulations

Any General or Specific Queries?

Specialization term scores • tf (t, s) : term frequency of term t in a segment s • |S|/sf(t) : how dominant is term t in segment s compared to other segments of the same document • idf(t) : how rare is term t in the collection | S | φ = + − ( t , s ) a . tf ( t , s ). ( 1 a ). log( idf ( t )) sf ( t ) • Add n s terms with top φ (t,s) scores for more specific query

Generalization term scores • tf(t,d) : term frequency of term t in a document d (instead of frequency in individual segments) • sf(t)/|S| : segment frequency (instead of inverse segment frequency) • idf(t) : Inverse document frequency: sf ( t ) ψ = + − ( t ) a . tf ( t , d ). ( 1 a ). log( idf ( t )) | S | • Select n g terms with top ψ (t) scores for more general query

Simulation of Within-Session Query Variations using a Text - PowerPoint PPT Presentation

Simulation of Within-Session Query Variations using a Text Segmentation Approach Debasis Ganguly Johannes Leveling Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Introduction to query reformulation

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Monthly & Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Query Execuon Declarave Query (SQL) We start from

APA Design: Physics Motivation and Detector Performance Mitch Soderberg APA Design Review July

1 Sampling error The long term average value for p is 0.5; lets call that E( p ). How do we

PhUSE Computational Science Working Groups Solutions Through Collaboration PHUSE CS OVERVIEW

Custom izing the BlaiseI S XSLT Stylesheets and ASP Scripts Edwin de Vet & Arnaud Wijnant

Optimal investment and hedging under partial information Michael Monoyios Mathematical Institute,

S TATUS OF T HE D UAL P HASE L IQUID A RGON TPC D EVELOPMENTS FOR THE DUNE EXPERIMENT L AURA Z

Consistent yield curve modelling Philipp Harms joint work with David Stefanovits, Josef

Localization 1 Odometric Localization planning and feedback control require the knowledge of

Simulation of Within-Session Query Variations using a Text - PowerPoint PPT Presentation

Simulation of Within-Session Query Variations using a Text Segmentation Approach Debasis Ganguly Johannes Leveling Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Introduction to query reformulation

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Monthly &amp; Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval &gt; Query Us User er Query Words Query Words Search Personalization

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Query Execu*on Declara*ve Query (SQL) We start from

APA Design: Physics Motivation and Detector Performance Mitch Soderberg APA Design Review July

1 Sampling error The long term average value for p is 0.5; lets call that E( p ). How do we

PhUSE Computational Science Working Groups Solutions Through Collaboration PHUSE CS OVERVIEW

Custom izing the BlaiseI S XSLT Stylesheets and ASP Scripts Edwin de Vet &amp; Arnaud Wijnant

Optimal investment and hedging under partial information Michael Monoyios Mathematical Institute,

S TATUS OF T HE D UAL P HASE L IQUID A RGON TPC D EVELOPMENTS FOR THE DUNE EXPERIMENT L AURA Z

Consistent yield curve modelling Philipp Harms joint work with David Stefanovits, Josef

Localization 1 Odometric Localization planning and feedback control require the knowledge of

Monthly & Quarterly Tariff Variations July 2016 to June 2019 Tariff Variations Tariff

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Query Execuon Declarave Query (SQL) We start from

Custom izing the BlaiseI S XSLT Stylesheets and ASP Scripts Edwin de Vet & Arnaud Wijnant