What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) (Et i i 1996 CACM 39(11)) Web mining aims to discovery useful information or m g m y f f m knowledge from the Web hyperlink structure, page Based on several presentations found on the web: content and usage data. g Sh Shapiro, Ullman, Terziyan, Pedersen ... i Ull T i P d (Bing LIU 2007, Web Data Mining, Springer) 1 2 What is Web Mining? What is Web Mining? Wh t i W b Mi i Wh t i W b Mi i ? ? Abundance and authority crisis Ab Ab Abundance and authority crisis d d d d th th it it i i i i Motivation / Opportunity Liberal and informal culture of content generation and dissemination The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining d h f h f d Redundancy and non-standard form and content Intelligent Web Search Millions of qualifying pages for most broad queries M ll ons of qual fy ng pages for most broad quer es Personalization, Recommendation Engines P li ti R d ti E i Example: java or kayaking Web-commerce applications Building the Semantic Web Building the Semantic Web No authoritative information about the reliability of a site N th it ti i f ti b t th li bilit f it Web page classification and categorization Little support for adapting to the background of specific users News classification and clustering News classification and clustering Pages added continuously and average page changes in a few Information / trend monitoring weeks Analysis of online communities y Web and mail spam filtering 3 4
Diff Diff Different from “classical” Data Mining? Different from “classical” Data Mining? t f t f “ l “ l i i l” D t Mi i l” D t Mi i ? ? The web is not a relation Textual information + linkage structure Usage data is huge and growing rapidly Google’s usage logs are bigger than their web crawl Data generated per day is comparable to largest conventional Data generated per day is comparable to largest conventional data warehouses 5 6 Size of the Web Si Size of the Web Si f th W b f th W b October 2006 Web Server Survey O t b O t b October 2006 Web Server Survey 2006 W b S 2006 W b S S S Number of pages Number of pages 11.5 billion indexable pages ( http://www.cs.uiowa.edu/~asignori/web-size/ www2005 ) Technically, infinite Because of dynamically generated content Lots of duplication (30-40%) Best estimate of “unique” static HTML pages comes from search engine claims i l i Yahoo = claimed 19.2 billion in Aug 2005 Number of unique web sites Netcraft survey says 98 million sites http://news.netcraft.com/archives/web_server_survey.html 7 8
from from http://www.worldwidewebsize.com/ http://www.worldwidewebsize.com/ Another way to estimate the web size Another way to estimate the web size A A th th t t tim t th tim t th b i b i The number of web servers was estimated by sampling and testing random IP address numbers and determining the fraction of such tests that successfully located a web server The estimate of the average number of pages per server was obtained by crawling a sample of the servers server was obtained by crawling a sample of the servers identified in the first experiment Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature , 400(6740): 107–109. web Nature 400(6740): 107–109 9 10 Web Information Retrieval Web Information Retrieval f f m m Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? The Abundance Problem (99% of information of no interest to 99% The Abundance Problem (99% of information of no interest to 99% According to most predictions, the majority of human information of people) will be available on the Web in ten??? years Hundreds of irrelevant documents returned in response to a search p query Effective information retrieval can aid in Limited Coverage of the Web (Internet sources hidden behind Research: Find all papers about web mining Research: Find all papers about web mining search interfaces) search interfaces) Health/ Medicine : What could be reason for symptoms of “yellow Largest crawlers cover less than 18% of Web pages eyes”, high fever and frequent vomiting The Web is extremely dynamic The Web is extremely dynamic Travel: Find information on the tropical island of St. Lucia Lots of pages added, removed and changed every day Business: Find companies that manufacture digital signal processors Very high dimensionality (thousands of dimensions) Very high dimensionality (thousands of dimensions) Entertainment: Find all movies starring Marilyn Monroe during the Limited query interface based on keyword-oriented search years 1960 and 1970 Arts: Find all short stories written by Jhumpa Lahiri Arts: Find all short stories written by Jhumpa Lahiri Limited cust mizati n t individual users Limited customization to individual users 11 12
Search Landscape Search Landscape p 2005 S Search Engine Web Coverage Overlap Search Engine Web Coverage Overlap S h E h E i i W b C W b C O O l l 4 searches were d f defined that d h returned 141 web pages. Sept 2009 Sept 2009 http://www.searchengineshowdown.com/stats/overlap.shtml http://marketshare.hitslink.com/search-engine-market-share.aspx?qprid=4 13 14 Web search basics W b W b Web search basics h b h b i i Web Crawling Basics W b C Web Crawling Basics W b C li li B B i i Start with a “seed set” of to-visit urls Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 User Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! pp g All models. Helpful advice. www.best-vacuum.com to visit urls get next url Web Results 1 - 10 of about 7,310,000 for miele . ( 0.12 seconds) Miele , Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... Web crawler www. miele .com/ - 20k - Cached - Similar pages Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages get page Web eb visited urls visited urls Search Indexer extract urls web pages The Web Indexes Ad indexes 15 16
C C Crawling Issues Crawling Issues li li I I Web Advertising Web Advertising W b Ad W b Ad ti i ti i Banner ads (1995 2001) Banner ads (1995-2001) Load on web servers L ad n web servers Initial form of web advertising E.g., no more than 1 request to the same server every 10 seconds Popular websites charged X$ for every 1000 impressions” of ad P p l bsit s h d X$ f 1000 “imp ssi ns” f d Insufficient resources to crawl entire web Modeled similar to TV, magazine ads Visit “important” pages first (pagerank, inlinks …) L Low clickthrough rates li kth u h t s How to keep crawled pages “fresh”? low ROI for advertisers How often do web pages change? What do we mean by freshness? p g g y Introduced by Overture around 2000 I t d d b O t d 2000 Detecting replicated content e.g., mirrors Advertisers “bid” on search keywords Use document comparison techniques (java manuals) Use document comparison techniques (java manuals) When someone searches for that keyword, the highest bidder’s ad Can’t crawl the web from one machine is shown Advertiser is charged only if the ad is clicked on Advertiser is charged only if the ad is clicked on Parallelizing the crawl P ll li i th l 17 18 Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy W b Ad Web Advertising Web Advertising W b Ad ti i ti i Search advertising is the revenue model Web Mining Multi-billion-dollar industry Multi-billion-dollar industry Advertisers pay for clicks on their ads Interesting problems What ads to show for a search? What ads to show for a search? Maximise revenue, each advertiser has a limited budget Web Web Web Usage Web Usage C Content S Structure If I’m an advertiser, which search terms should I bid on and Mining Mining Mining how much to bid? 19 20
Recommend
More recommend