web dynamics
play

Web Dynamics L3S Research Center Hannover, Germany Analysis and - PDF document

August 6-10, 2012 Lecturers Dr. Ismail Sengr Altingvde Senior researcher Web Dynamics L3S Research Center Hannover, Germany Analysis and Implications from Dr. Nattiya Kanhabua Search Perspective Postdoctoral researcher


  1. August 6-10, 2012 Lecturers Dr. Ismail Sengör Altingövde – Senior researcher Web Dynamics – L3S Research Center – Hannover, Germany Analysis and Implications from Dr. Nattiya Kanhabua Search Perspective – Postdoctoral researcher – L3S Research Center – Hannover, Germany 1 2 L3S Research Center People  Computer science and interdisciplinary research  80 researchers: 53% post-docs & PhD students, on all aspects related to the Web 38% scientific research assistants & int’l interns  15 countries from US and Latin-America, Asia, Europe and Africa  Equal opportunity: 27% females 3 4 The 6th Russian Summer School in IR (RuSSIR'2012) 1

  2. August 6-10, 2012 All project inventory! L3S: Some Projects Cubrik: Human-enhanced ARCOMEM: From collect-all time-aware multimedia search archives to community memories www.l3s.de Terence: Adaptive learning for M-Eco: The medical ecosystem 5 6 poor comprehenders detecting disease outbreaks Outline Wanted! Day 1: Introduction Day 2: Evolution of Web Search Results • Evolution of the Web • Short-term impacts • Overview of research topics • Longitudinal analysis • Content and query analysis Day 4: Retrieval and Day 3: Indexing the Past Ranking • Indexing and searching • Searching the past versioned documents • Searching the future 7 8 The 6th Russian Summer School in IR (RuSSIR'2012) 2

  3. August 6-10, 2012 Evolution of the Web Size dynamics • Web is changing over time in many aspects: • Challenge – Size : web pages are added/deleted all the time – Crawling, indexing, and caching – Content : web pages are edited/modified – Query : users’ information needs changes, entity-relationship changes over time – Usage : users’ behaviors change over time [Dumais, SIAM-SDM 2012; WebDyn 2010] 9 10 [Ke et al., CN 2006; Risvik et al., CN 2002] Web and index sizes Web and index sizes 2000 2000 First billion-URL index First billion-URL index The world’s largest! The world’s largest! ≈5000 PCs in clusters! ≈5000 PCs in clusters! 1995 2012 1995 2012 2004 Index grows to 4.2 billion pages 11 12 The 6th Russian Summer School in IR (RuSSIR'2012) 3

  4. August 6-10, 2012 Web and index sizes Web and index sizes 2000 First billion-URL index The world’s largest! ≈5000 PCs in clusters! 2008 Google counts 1 trillion unique URLs 1995 2012 2004 Index grows to 4.2 billion pages 13 14 Web and index sizes Content dynamics http://www.worldwidewebsize.com/ • Challenge – Document representation and retrieval 15 16 WayBack Machine : a web archive search tool by the Internet Archive The 6th Russian Summer School in IR (RuSSIR'2012) 4

  5. August 6-10, 2012 Content dynamics Query dynamics • Challenge • Challenge – Document representation and retrieval – Time-sensitive queries – Query understanding and processing Query: Halloween 17 18 Google Insights for Search : http://www.google.com/insights/search/ Usage dynamics Example application • Searching documents created/edited over time • Challenge – E.g., web archives, news archives, blogs, or emails – Browsing and search behavior – User preference (e.g., “likes”) 19 20 The 6th Russian Summer School in IR (RuSSIR'2012) 5

  6. August 6-10, 2012 Example application Example application • Searching documents created/edited over time • Searching documents created/edited over time – E.g., web archives, news archives, blogs, or emails – E.g., web archives, news archives, blogs, or emails Retrieve documents about Pope Benedict XVI written before 2005 Web news Web news archives archives archives archives blogs emails blogs emails “temporal document “temporal document collections” collections” Term-based IR approaches may give unsatisfied results 21 22 Wayback Machine 1 Wayback Machine 1 • A web archive search tool by the Internet Archive • A web archive search tool by the Internet Archive – Query by a URL, e.g., http://www.ntnu.no – Query by a URL, e.g., http://www.ntnu.no No keyword query 1 Retrieved on 15 January 2011 1 Retrieved on 15 January 2011 23 24 The 6th Russian Summer School in IR (RuSSIR'2012) 6

  7. August 6-10, 2012 Wayback Machine 1 Google News Archive Search • A news archive search tool by Google • A web archive search tool by the Internet Archive – Query by keywords – Query by a URL, e.g., http://www.ntnu.no – Rank results by relevance or date No keyword query No relevance ranking 1 Retrieved on 15 January 2011 25 26 Google News Archive Search Research topics • A news archive search tool by Google • Content Analysis – Query by keywords – Determining timestamps of documents – Rank results by relevance or date – Temporal information extraction • Query Analysis Not consider – Determining time of queries terminology – Named entity evolution changes over time – Query performance prediction • Evolution of Search Results – Short-term impacts on result caches – Longitudinal analysis of search results 27 28 The 6th Russian Summer School in IR (RuSSIR'2012) 7

  8. August 6-10, 2012 Research topics (cont’) • Indexing – Indexing and query processing techniques for the versioned document collections Content Analysis • Retrieval and Ranking – Searching the past (1) Determining timestamps of documents – Searching the future (2) Temporal information extraction 29 Motivation Two time aspects • Incorporating the time dimension into 1. Publication or modified time search can increase retrieval effectiveness • Problem: determining timestamps of documents • Method: rule-based technique, or temporal language – Only if temporal information is available models • Research question 2. Content or event time – How to determine the temporal information of • Problem: temporal information extraction documents? • Method: natural language processing, or time and event recognition algorithms 31 32 [Kanhabua, SIGIR Forum 2012] The 6th Russian Summer School in IR (RuSSIR'2012) 8

  9. August 6-10, 2012 Two time aspects Determining time of documents content time Problem Statements 1. Publication or modified time • Difficult to find the trustworthy time for web documents • Problem: determining timestamps of documents – Time gap between crawling and indexing – Decentralization and relocation of web documents • Method: rule-based technique, or temporal language – No standard metadata for time/date models 2. Content or event time • Problem: temporal information extraction • Method: natural language processing, or time and publication time event recognition algorithms 33 34 Determining time of documents Determining time of documents Problem Statements Problem Statements Difficult to find the trustworthy time for web documents Difficult to find the trustworthy time for web documents • • – Time gap between crawling and indexing – Time gap between crawling and indexing – Decentralization and relocation of web documents – Decentralization and relocation of web documents – No standard metadata for time/date – No standard metadata for time/date “ For a given document with uncertain “ For a given document with uncertain timestamp, can the contents be used to timestamp, can the contents be used to determine the timestamp with a sufficiently determine the timestamp with a sufficiently high confidence? ” high confidence? ” Let’s me see… This document is I found a bible-like I found a bible-like probably document. But I have document. But I have written in 850 A.C. no idea when it was no idea when it was with 95% confidence. created? created? 35 36 The 6th Russian Summer School in IR (RuSSIR'2012) 9

  10. August 6-10, 2012 Current approaches Content-based approach 1. Content-based Temporal Language Models Temporal Language Models 2. Link-based • Based on the statistic usage Partition Word Freq of words over time 3. Hybrid 1999 tsunami 1 1999 Japan 1 • Compare each word of a 1999 tidal wave 1 non-timestamped document 2004 tsunami 1 2004 Thailand 1 with a reference corpus 2004 earthquake 1 • Tentative timestamp -- a time partition mostly overlaps in word usage 37 38 [de Jong et al., AHC 2005] Content-based approach Content-based approach Temporal Language Temporal Language Models Models Temporal Language Models Temporal Language Models A non-timestamped A non-timestamped • Based on the statistic usage • Based on the statistic usage Partition Word Freq Partition Word Freq document document of words over time of words over time 1999 tsunami 1 1999 tsunami 1 tsunami 1999 Japan 1 tsunami 1999 Japan 1 • Compare each word of a • Compare each word of a 1999 tidal wave 1 1999 tidal wave 1 Thailand Thailand non-timestamped document 2004 tsunami 1 non-timestamped document 2004 tsunami 1 2004 Thailand 1 2004 Thailand 1 with a reference corpus with a reference corpus 2004 earthquake 1 2004 earthquake 1 • Tentative timestamp -- a • Tentative timestamp -- a time partition mostly time partition mostly overlaps in word usage overlaps in word usage 39 40 [de Jong et al., AHC 2005] [de Jong et al., AHC 2005] The 6th Russian Summer School in IR (RuSSIR'2012) 10

Recommend


More recommend