document clustering for mediated information access the
play

Document Clustering for Mediated Information Access The WebCluster - PowerPoint PPT Presentation

Document Clustering for Mediated Information Access The WebCluster Project Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at the Robert


  1. Document Clustering for Mediated Information Access – The WebCluster Project – Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at the Robert Gordon University, Aberdeen, UK. It was supervised by Prof. David J. Harper and sponsored by Ubilab, Zurich. Current work is being conducted in collaboration with Ph.D. student Hyuk-Jin Lee and Prof. Nicholas J. Belkin. Exploratory Search Interfaces: Categorization, Clustering and Beyond Gheorghe Muresan Workshop at HCIL 2005, University of Maryland, June 2, 2004 SCILS, Rutgers University

  2. WebCluster - Motivation Information (within some subject domain) Need WWW_SearchEngine Domain Search engine Query Gulfs � – information need ↔ query – structured subject domain ↔ unstructured target collection (WWW) Gheorghe Muresan SCILS, Rutgers University

  3. Interaction in the library Information 2. Consult catalog need 1. Select library Information Need Formulation 3. Browse shelves 4. Use inter-library scheme Gheorghe Muresan SCILS, Rutgers University

  4. Can we simulate the library interaction ? Structured Information source need collections 3. Search WWW 1. Select source collection Results Information Need Formulation 2. Explore source collection Results with ClusterBook Gheorghe Muresan SCILS, Rutgers University

  5. The mediated access interaction Specialised Information source Web search engine need WebCluster Topical documents Query Target collection (WWW) Gheorghe Muresan SCILS, Rutgers University

  6. Interaction model vs. prototype � Structuring the source collection w Document clustering w Supervised classification w Manual (intellectual) classification � Exploring the structured source collection w Metaphor – Library, book, encyclopaedia w Visualization tool – Folder metaphor, hyperbolic tree, themescape, cone trees, thematic maps w Search strategies supported – Best match or cluster-based searching, browsing Gheorghe Muresan SCILS, Rutgers University

  7. Model vs. prototype � Interaction model w Explicit (the user marks relevant documents) vs. implicit (cues on relevance are derived based on user behavior/actions) w Transparent (the user is aware) vs. opaque (the user is happy to see effect of ‘magic’) w Automatic vs. manual/intellectual generation of the mediated query � Query model w Language models (generative, Kullback-Leibler) w Probabilistic models w Rocchio or other RF-specific formulae Gheorghe Muresan SCILS, Rutgers University

  8. ClusterBook - Source collection Gheorghe Muresan SCILS, Rutgers University

  9. ClusterBook - Target collection Gheorghe Muresan SCILS, Rutgers University

  10. Informal experiments - Objectives - � Test the users’ reaction to the mediated access concept � Test the user satisfaction regarding the functionality of the system, and the relevance of the documents retrieved � Formative usability testing - some volunteers were not only experienced searchers, but also had experience in evaluating IR systems � Comparison of user generated queries vs. system generated queries � Note. These experiments were run at different stages of the development Gheorghe Muresan SCILS, Rutgers University

  11. Informal experiments - Experimental procedure - Subjects received introduction to the system � Task assigned: “You are a trainee in a newspaper. You support the � journalists by providing information for the topic of their articles.” Sample topics: � w The history of the Brasilian debt crisis w How are the quotas for growing coffee set and controlled on a world-wide basis ? Source collection: a sub-collection of Reuters (newspaper articles) � Steps followed by users (explicit scenario): � w Formulate a query and record it w Browse source collection, select ‘best’ cluster, edit query generated by system, submit it to the search engine w Submit to the same search engine the initial, self-generated query w Compare results of the two searches Gheorghe Muresan SCILS, Rutgers University

  12. Informal experiments - Results - Users found the mediation useful for unfamiliar topics � The system nearly always proposed new, good query terms � Users not always good at recognizing ‘good’ query terms � The system proposed bad query terms (not specific to the topic) � ⇒ the opaque scenario not viable unless the query formulation is improved The two-step process was questioned when: � w the query formulation was considered easy, for a familiar topic w the documents of the source collection were considered sufficient to cover the information need Complete link, group average – OK; single link – bad � Overall, the system is usable � Gheorghe Muresan SCILS, Rutgers University

  13. Consequences of informal experiments � Formal experiments are needed to verify the main assumptions: w The Cluster Hypothesis holds for a specialized collection w Good clusters can be found with the search strategies provided w Mediated queries can improve retrieval effectiveness � The effect on retrieval performance of various parameters should be compared w Weighting schemes w Clustering methods w Search strategies Gheorghe Muresan SCILS, Rutgers University

  14. Critical issue: The label generation w Document representatives w searching Wind Energy w Cluster representatives w browsing ... w searching Power Generation Propulsion w mediation Collection representatives w Fixed Plants collection selection w Coastal Wind Farms Inland Wind Farms Portable Generators ... Pacific Rim Design of Wind generators Design of Coastal Desert Wind Farms Wind Farms Wind Farms …. for yachts Gheorghe Muresan SCILS, Rutgers University

  15. Mediation experiment - simulations � Objectives: w Test the potential of mediation to increase retrieval effectiveness w Test the effect on performance of a variety of parameters Cluster-based mediation (realistic mediation) Search engine Search engine Topic-based mediator (upperbound) Target Simple query generator collection Source collection (baseline) Gheorghe Muresan SCILS, Rutgers University

  16. Experimental setup � Interactive track of TREC-8 w Offers relevance judgments for complex topics, with a multitude of aspects w Offers the experimental design for the user experiment w Six topics with 12 to 56 aspects each w Target collection: FT 1991-4, with 210,158 articles w Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant Gheorghe Muresan SCILS, Rutgers University

  17. Results – the cluster hypothesis � Aspectual cluster hypothesis confirmed by an extended version of the van Rijsbergen – Sparck Jones separation test w Similarity between pairs of docs covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection � Consequence confirmed: clustering groups documents in pockets of relevance Gheorghe Muresan SCILS, Rutgers University

  18. Results – retrieval effectiveness � Tf-Idf > KL > RelFreq as weighting schemes for document representation � Adding disambiguation terms to the query increases recall, but decreases precision � Nearest-neighbor mediation (“more like this”) highly significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect � Cosine and Dice performs similarly Gheorghe Muresan SCILS, Rutgers University

  19. Mediation results � Upperbound experiment (all relevant docs known in source) w Both recall and precision increase with query length w Query term weights strongly affect performance w No evidence that uniformity of term frequency affects performance � Clustered source mediation w Best cluster mediation increases P, decreases R w “Fuse and search” – strong increase in R and P w “Search and fuse” – good R, terrible P ! Gheorghe Muresan SCILS, Rutgers University

  20. User experiment – effectiveness of mediated information retrieval for Web searches Query formulation (between subjects) n o ) i s t t a Unaided Mediated c t e n j e b s u e Linear s Source-based r p n Baseline i t h mediation l (list) u t i s w e ( R Structured On the fly Source & target – clustering based mediation (cluster) Gheorghe Muresan SCILS, Rutgers University

  21. User experiment – no mediation Gheorghe Muresan SCILS, Rutgers University

  22. User experiment – mediated access Gheorghe Muresan SCILS, Rutgers University

  23. User experiment – mediated access Gheorghe Muresan SCILS, Rutgers University

  24. Contributions of WebCluster � Proposes and explores system-based mediated access to very large heterogeneous document collections � Explores the use of clustering for capturing the topical, semantic structure of a problem domain (as represented by a specialized collection) � Explores the use of language models for building cluster and document representatives � Offers a framework for building structured portals on the WWW � Offers a framework for building collaborative environments Gheorghe Muresan SCILS, Rutgers University

Recommend


More recommend