The Evolution of Web Content and Search Engines Ricardo Baeza - PowerPoint PPT Presentation

The Evolution of Web Content and Search Engines Ricardo Baeza Baeza- -Yates Yates Ricardo Yahoo! Research, DCC/UChile, UPF/Spain Álvaro lvaro Pereira Pereira Jr Jr Á DCC/UFMG/Brazil Nivio Ziviani Ziviani Nivio DCC/UFMG/Brazil

Objectives � To state the following hypothesis: � When pages have sources (content originated from other pages), in a portion of pages there was a query that related the sources and made possible the creation of the new page Part of the web content is biased by the ranking function of search � engines � To study how new content is generated in the web � How old content is used to compose new pages � Definition of genealogical trees for the web WEBKDD’06, August 20, 2006, Philadelphia, USA

Web Collections and Query Logs Collection Crawling Number of Text size (Gbytes) date documents 2002 Jul 2002 892,000 2.3 2003 Aug 2003 2.86 mi 9.4 2004 Jan 2004 2.80 mi 11.8 2005 Feb 2005 2.88 mi 11.3 Log 2002 Log 2003 Log 2004 Jul/02 Aug/03 Jan/04 Feb/05 WEBKDD’06, August 20, 2006, Philadelphia, USA

Algorithm � Objective: To find in the new collection documents that were created using content � from old documents, returned by the same query � For that we simulate a user performing a query in the search engine (TodoCL) in the past � We used a set of the most frequent queries of each query log We had access to the query processor of the search engine � � Algorithm divided into two steps First step: finding new documents that has content from the old First step � � documents Second step: filter the documents Second step � � WEBKDD’06, August 20, 2006, Philadelphia, USA

Algorithm – Step 1: Finding Candidates WEBKDD’06, August 20, 2006, Philadelphia, USA

Algorithm – Step 2: Filtering � Number of paragraphs in both old and new documents � New document composed by two old documents returned by the same query � At least two distinct paragraphs from each old document � The new document URL cannot exist in the old collection � Duplicates are not allowed for both old and new documents WEBKDD’06, August 20, 2006, Philadelphia, USA

Experiments Summary Log 2002 Log 2003 Log 2004 Jul/02 Aug/03 Jan/04 Feb/05 Log 2002 Real Log 2003 query log Log 2004 2003 2004 WEBKDD’06, August 20, 2006, Philadelphia, USA

An Experimental Result � Different query logs on old collection 2003 and new collection 2004 300 log 2003 on col. 2003 log 2004 on col. 2003 250 log 2002 on col. 2003 New documents found 200 150 100 50 0 5 10 15 20 25 30 Minimal number of identical paragraphs WEBKDD’06, August 20, 2006, Philadelphia, USA

Chilean Web Genealogical Tree � Main components of the tree considering collection 2002 as the old collection � Sample of 120,000 documents Collection pairs 2002-2003 2002-2004 2002-2005 Number of parents 5,900 4,900 4,300 Number of children 13,500 8,900 9,700 Number of survived pages 13,900 10,700 6,800 WEBKDD’06, August 20, 2006, Philadelphia, USA

Conclusions � We have presented evidences that a portion of the web is biased by the ranking function of search engines � A significant portion of the Web has evolved from old content � The number of copies from previously copied web pages (or content) is indeed greater than the number of copies from other pages � Do search engines contribute to this situation? WEBKDD’06, August 20, 2006, Philadelphia, USA

Thank You! WEBKDD’06, August 20, 2006, Philadelphia, USA

Bimonthly Logs Bimonthly logs 2002 Bimonthly logs 2004 1 2 3 4 5 1 2 3 4 5 Jul/02 Aug/03 Jan/04 Feb/05 WEBKDD’06, August 20, 2006, Philadelphia, USA

Bimonthly Logs on the Same Collection � Bimonthly logs 2002 � Bimonthly logs 2004 90 90 bimonthly log 5 bimonthly log 5 bimonthly log 4 bimonthly log 4 80 80 bimonthly log 3 bimonthly log 3 bimonthly log 2 bimonthly log 2 New documents found New documents found bimonthly log 1 bimonthly log 1 70 70 60 60 50 50 40 40 30 30 20 20 5 10 15 20 25 30 5 10 15 20 25 30 Minimal number of identical paragraphs Minimal number of identical paragraphs WEBKDD’06, August 20, 2006, Philadelphia, USA

Bimonthly Logs in Different Collections � Bimonthly logs 4 and � Bimonthly logs 4 and 5 used for collection 5 used for collection 2002 2004 160 160 log 2002 on col. 2002 log 2004 on col. 2004 log 2004 on col. 2002 log 2002 on col. 2004 140 140 New documents found New documents found 120 120 100 100 80 80 60 60 40 40 20 5 10 15 20 25 30 5 10 15 20 25 30 Minimal number of identical paragraphs Minimal number of identical paragraphs WEBKDD’06, August 20, 2006, Philadelphia, USA

Chilean Web Genealogical Tree (2/2) � Main component of the tree considering collection 2003 as the old collection � Sample of 120,000 documents Collection pairs 2003-2004 2003-2005 Number of parents 5,300 5,000 Number of children 33,200 29,100 Number of survived pages 19,300 10,500 WEBKDD’06, August 20, 2006, Philadelphia, USA

The Evolution of Web Content and Search Engines Ricardo Baeza - PowerPoint PPT Presentation

The Evolution of Web Content and Search Engines Ricardo Baeza Baeza- -Yates Yates Ricardo Yahoo! Research, DCC/UChile, UPF/Spain lvaro lvaro Pereira Pereira Jr Jr DCC/UFMG/Brazil Nivio Ziviani Ziviani Nivio DCC/UFMG/Brazil

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Game Engines 1 Overview Game engines are a significant part of the modern games industry

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo

Web Mining for Knowledge Discovery Current Search Engine Search engines are doing good jobs

Implementing a Real-Time Survey System and the 6E Framework An Early Case Study on Improving

Advisory Committee of CERN Users Report Eric Torrence University of Oregon US-LUA Annual

Principles of Software Construction: Objects, Design, and Concurrency A formal design process

Shenzhen 2015 Enable organizations transform their businesses by harnessing the power of

Communicating with Databases String based queries are prevalent: u JPA, Hibernate, TopLink

Discover the Power in YOUR story with AT and Creative Media Presented by: Brooke Brown Founder

Opinion Extraction Task Opinion Mining Reviews A popular topic in opinion analysis is

ManaTI Web Assistance for the Threat Analyst, supported by Domain Similarity SEBASTIN GARCA

Sambuz

Useful Links

Newsletter

Mail Us

The Evolution of Web Content and Search Engines Ricardo Baeza - PowerPoint PPT Presentation

The Evolution of Web Content and Search Engines Ricardo Baeza Baeza- -Yates Yates Ricardo Yahoo! Research, DCC/UChile, UPF/Spain lvaro lvaro Pereira Pereira Jr Jr DCC/UFMG/Brazil Nivio Ziviani Ziviani Nivio DCC/UFMG/Brazil

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set 10 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

NCC Education and You Study and Communication Skills Your Name Internet Search Engines Date

The Overview of Web Search Engines Presented by Sunny Lam Outline Introduction Information

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web

Game Engines 1 Overview Game engines are a significant part of the modern games industry

CS490W Without search engines the web wouldnt scale The acceptance of search interaction makes

1 A Comparison of Open Source Search A Comparison of Open Source Search Engines Engines

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Web Data Representation Web Graph, Text, Images, Metadata, Search spaces Web Search 1 The Web

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Optimizing Result Prefetching in Web Search Engines with Segmented Indices Ronny Lempel Shlomo

Web Mining for Knowledge Discovery Current Search Engine Search engines are doing good jobs

Implementing a Real-Time Survey System and the 6E Framework An Early Case Study on Improving

Advisory Committee of CERN Users Report Eric Torrence University of Oregon US-LUA Annual

Principles of Software Construction: Objects, Design, and Concurrency A formal design process

Shenzhen 2015 Enable organizations transform their businesses by harnessing the power of

Communicating with Databases String based queries are prevalent: u JPA, Hibernate, TopLink

Discover the Power in YOUR story with AT and Creative Media Presented by: Brooke Brown Founder

Opinion Extraction Task Opinion Mining Reviews A popular topic in opinion analysis is

ManaTI Web Assistance for the Threat Analyst, supported by Domain Similarity SEBASTIN GARCA

Sambuz

Useful Links

Newsletter

Mail Us

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set 10 Search Engines & SEO Outline How do search engines work? Basic operation

Set11 Search Engines & SEO Outline How do search engines work? Basic operation