Web Mining Web Mining Web Mining Web Mining Web mining is the use - PowerPoint PPT Presentation

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining  Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) (Et i i 1996 CACM 39(11))  Web mining aims to discovery useful information or m g m y f f m knowledge from the Web hyperlink structure, page Based on several presentations found on the web: content and usage data. g Sh Shapiro, Ullman, Terziyan, Pedersen ... i Ull T i P d (Bing LIU 2007, Web Data Mining, Springer) 1 2 What is Web Mining? What is Web Mining? Wh t i W b Mi i Wh t i W b Mi i ? ? Abundance and authority crisis Ab Ab Abundance and authority crisis d d d d th th it it i i i i  Motivation / Opportunity  Liberal and informal culture of content generation and dissemination  The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining d h f h f d  Redundancy and non-standard form and content  Intelligent Web Search  Millions of qualifying pages for most broad queries M ll ons of qual fy ng pages for most broad quer es  Personalization, Recommendation Engines P li ti R d ti E i  Example: java or kayaking  Web-commerce applications  Building the Semantic Web  Building the Semantic Web  No authoritative information about the reliability of a site N th it ti i f ti b t th li bilit f it  Web page classification and categorization  Little support for adapting to the background of specific users  News classification and clustering  News classification and clustering  Pages added continuously and average page changes in a few  Information / trend monitoring weeks  Analysis of online communities y  Web and mail spam filtering 3 4

Diff Diff Different from “classical” Data Mining? Different from “classical” Data Mining? t f t f “ l “ l i i l” D t Mi i l” D t Mi i ? ?  The web is not a relation  Textual information + linkage structure  Usage data is huge and growing rapidly  Google’s usage logs are bigger than their web crawl  Data generated per day is comparable to largest conventional  Data generated per day is comparable to largest conventional data warehouses 5 6 Size of the Web Si Size of the Web Si f th W b f th W b October 2006 Web Server Survey O t b O t b October 2006 Web Server Survey 2006 W b S 2006 W b S S S  Number of pages  Number of pages  11.5 billion indexable pages ( http://www.cs.uiowa.edu/~asignori/web-size/ www2005 )  Technically, infinite  Because of dynamically generated content  Lots of duplication (30-40%)  Best estimate of “unique” static HTML pages comes from search engine claims i l i  Yahoo = claimed 19.2 billion in Aug 2005  Number of unique web sites  Netcraft survey says 98 million sites http://news.netcraft.com/archives/web_server_survey.html 7 8

from from http://www.worldwidewebsize.com/ http://www.worldwidewebsize.com/ Another way to estimate the web size Another way to estimate the web size A A th th t t tim t th tim t th b i b i  The number of web servers was estimated by sampling and testing random IP address numbers and determining the fraction of such tests that successfully located a web server  The estimate of the average number of pages per server was obtained by crawling a sample of the servers server was obtained by crawling a sample of the servers identified in the first experiment Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature , 400(6740): 107–109. web Nature 400(6740): 107–109 9 10 Web Information Retrieval Web Information Retrieval f f m m Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult?  The Abundance Problem (99% of information of no interest to 99%  The Abundance Problem (99% of information of no interest to 99%  According to most predictions, the majority of human information of people) will be available on the Web in ten??? years  Hundreds of irrelevant documents returned in response to a search p query  Effective information retrieval can aid in  Limited Coverage of the Web (Internet sources hidden behind  Research: Find all papers about web mining  Research: Find all papers about web mining search interfaces) search interfaces)  Health/ Medicine : What could be reason for symptoms of “yellow  Largest crawlers cover less than 18% of Web pages eyes”, high fever and frequent vomiting  The Web is extremely dynamic  The Web is extremely dynamic  Travel: Find information on the tropical island of St. Lucia  Lots of pages added, removed and changed every day  Business: Find companies that manufacture digital signal processors  Very high dimensionality (thousands of dimensions)  Very high dimensionality (thousands of dimensions)  Entertainment: Find all movies starring Marilyn Monroe during the  Limited query interface based on keyword-oriented search years 1960 and 1970  Arts: Find all short stories written by Jhumpa Lahiri  Arts: Find all short stories written by Jhumpa Lahiri  Limited cust mizati n t individual users  Limited customization to individual users 11 12

Search Landscape Search Landscape p 2005 S Search Engine Web Coverage Overlap Search Engine Web Coverage Overlap S h E h E i i W b C W b C O O l l 4 searches were d f defined that d h returned 141 web pages. Sept 2009 Sept 2009 http://www.searchengineshowdown.com/stats/overlap.shtml http://marketshare.hitslink.com/search-engine-market-share.aspx?qprid=4 13 14 Web search basics W b W b Web search basics h b h b i i Web Crawling Basics W b C Web Crawling Basics W b C li li B B i i Start with a “seed set” of to-visit urls Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 User Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! pp g All models. Helpful advice. www.best-vacuum.com to visit urls get next url Web Results 1 - 10 of about 7,310,000 for miele . ( 0.12 seconds) Miele , Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... Web crawler www. miele .com/ - 20k - Cached - Similar pages Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages get page Web eb visited urls visited urls Search Indexer extract urls web pages The Web Indexes Ad indexes 15 16

C C Crawling Issues Crawling Issues li li I I Web Advertising Web Advertising W b Ad W b Ad ti i ti i  Banner ads (1995 2001)  Banner ads (1995-2001)  Load on web servers  L ad n web servers  Initial form of web advertising  E.g., no more than 1 request to the same server every 10 seconds  Popular websites charged X$ for every 1000 impressions” of ad P p l bsit s h d X$ f 1000 “imp ssi ns” f d  Insufficient resources to crawl entire web  Modeled similar to TV, magazine ads  Visit “important” pages first (pagerank, inlinks …)  L  Low clickthrough rates li kth u h t s  How to keep crawled pages “fresh”?  low ROI for advertisers  How often do web pages change? What do we mean by freshness? p g g y  Introduced by Overture around 2000 I t d d b O t d 2000  Detecting replicated content e.g., mirrors  Advertisers “bid” on search keywords  Use document comparison techniques (java manuals)  Use document comparison techniques (java manuals)  When someone searches for that keyword, the highest bidder’s ad  Can’t crawl the web from one machine is shown  Advertiser is charged only if the ad is clicked on  Advertiser is charged only if the ad is clicked on  Parallelizing the crawl P ll li i th l 17 18 Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy W b Ad Web Advertising Web Advertising W b Ad ti i ti i  Search advertising is the revenue model Web Mining  Multi-billion-dollar industry  Multi-billion-dollar industry  Advertisers pay for clicks on their ads  Interesting problems  What ads to show for a search?  What ads to show for a search?  Maximise revenue, each advertiser has a limited budget Web Web Web Usage Web Usage C Content S Structure  If I’m an advertiser, which search terms should I bid on and Mining Mining Mining how much to bid? 19 20

Web Mining Web Mining Web Mining Web Mining Web mining is the use - PowerPoint PPT Presentation

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Semantic Web Mining Bettina Berendt Humboldt-Universitt zu Berlin Institut fr

What is Web Mining? The use of data mining techniques to automatically RECOMMENDATION MODELS

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Mining Invariants from Logs for System Problem Detection Jian-Guang LOU, Qiang FU Software

Reng ng Z Zeng ng, Xu , Xudong ng H He Zhe heng ng L Liu, W , W.M .M.P .P. v .

Portrait of a Miner in a Landscape Alex Biryukov, Daniel Feher University of Luxembourg April

1 "#$%

Co nc e pt a nd Applic a tio ns o f Da ta Mining We e k 1 Topics Topics Introduction

Data Mining with Weka Department of Computer Science University of Waikato New Zealand

1 Introduction We consider a scenario where two parties having private databases wish to

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Web Mining Web Mining Web Mining Web Mining Web mining is the use - PowerPoint PPT Presentation

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Semantic Web Mining Bettina Berendt Humboldt-Universitt zu Berlin Institut fr

What is Web Mining? The use of data mining techniques to automatically RECOMMENDATION MODELS

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Mining Invariants from Logs for System Problem Detection Jian-Guang LOU, Qiang FU Software

Reng ng Z Zeng ng, Xu , Xudong ng H He Zhe heng ng L Liu, W , W.M .M.P .P. v .

Portrait of a Miner in a Landscape Alex Biryukov, Daniel Feher University of Luxembourg April

1 &quot;#$%

Co nc e pt a nd Applic a tio ns o f Da ta Mining We e k 1 Topics Topics Introduction

Data Mining with Weka Department of Computer Science University of Waikato New Zealand

1 Introduction We consider a scenario where two parties having private databases wish to

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

1 "#$%