Web Mining Web Mining Web Mining Web Mining Web mining is the use - PowerPoint PPT Presentation

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

  1. What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining  Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) (Et i i 1996 CACM 39(11))  Web mining aims to discovery useful information or m g m y f f m knowledge from the Web hyperlink structure, page Based on several presentations found on the web: content and usage data. g Sh Shapiro, Ullman, Terziyan, Pedersen ... i Ull T i P d (Bing LIU 2007, Web Data Mining, Springer) 1 2 What is Web Mining? What is Web Mining? Wh t i W b Mi i Wh t i W b Mi i ? ? Abundance and authority crisis Ab Ab Abundance and authority crisis d d d d th th it it i i i i  Motivation / Opportunity  Liberal and informal culture of content generation and dissemination  The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining d h f h f d  Redundancy and non-standard form and content  Intelligent Web Search  Millions of qualifying pages for most broad queries M ll ons of qual fy ng pages for most broad quer es  Personalization, Recommendation Engines P li ti R d ti E i  Example: java or kayaking  Web-commerce applications  Building the Semantic Web  Building the Semantic Web  No authoritative information about the reliability of a site N th it ti i f ti b t th li bilit f it  Web page classification and categorization  Little support for adapting to the background of specific users  News classification and clustering  News classification and clustering  Pages added continuously and average page changes in a few  Information / trend monitoring weeks  Analysis of online communities y  Web and mail spam filtering 3 4

  2. Diff Diff Different from “classical” Data Mining? Different from “classical” Data Mining? t f t f “ l “ l i i l” D t Mi i l” D t Mi i ? ?  The web is not a relation  Textual information + linkage structure  Usage data is huge and growing rapidly  Google’s usage logs are bigger than their web crawl  Data generated per day is comparable to largest conventional  Data generated per day is comparable to largest conventional data warehouses 5 6 Size of the Web Si Size of the Web Si f th W b f th W b October 2006 Web Server Survey O t b O t b October 2006 Web Server Survey 2006 W b S 2006 W b S S S  Number of pages  Number of pages  11.5 billion indexable pages ( http://www.cs.uiowa.edu/~asignori/web-size/ www2005 )  Technically, infinite  Because of dynamically generated content  Lots of duplication (30-40%)  Best estimate of “unique” static HTML pages comes from search engine claims i l i  Yahoo = claimed 19.2 billion in Aug 2005  Number of unique web sites  Netcraft survey says 98 million sites http://news.netcraft.com/archives/web_server_survey.html 7 8

  3. from from http://www.worldwidewebsize.com/ http://www.worldwidewebsize.com/ Another way to estimate the web size Another way to estimate the web size A A th th t t tim t th tim t th b i b i  The number of web servers was estimated by sampling and testing random IP address numbers and determining the fraction of such tests that successfully located a web server  The estimate of the average number of pages per server was obtained by crawling a sample of the servers server was obtained by crawling a sample of the servers identified in the first experiment Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature , 400(6740): 107–109. web Nature 400(6740): 107–109 9 10 Web Information Retrieval Web Information Retrieval f f m m Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult?  The Abundance Problem (99% of information of no interest to 99%  The Abundance Problem (99% of information of no interest to 99%  According to most predictions, the majority of human information of people) will be available on the Web in ten??? years  Hundreds of irrelevant documents returned in response to a search p query  Effective information retrieval can aid in  Limited Coverage of the Web (Internet sources hidden behind  Research: Find all papers about web mining  Research: Find all papers about web mining search interfaces) search interfaces)  Health/ Medicine : What could be reason for symptoms of “yellow  Largest crawlers cover less than 18% of Web pages eyes”, high fever and frequent vomiting  The Web is extremely dynamic  The Web is extremely dynamic  Travel: Find information on the tropical island of St. Lucia  Lots of pages added, removed and changed every day  Business: Find companies that manufacture digital signal processors  Very high dimensionality (thousands of dimensions)  Very high dimensionality (thousands of dimensions)  Entertainment: Find all movies starring Marilyn Monroe during the  Limited query interface based on keyword-oriented search years 1960 and 1970  Arts: Find all short stories written by Jhumpa Lahiri  Arts: Find all short stories written by Jhumpa Lahiri  Limited cust mizati n t individual users  Limited customization to individual users 11 12

  4. Search Landscape Search Landscape p 2005 S Search Engine Web Coverage Overlap Search Engine Web Coverage Overlap S h E h E i i W b C W b C O O l l 4 searches were d f defined that d h returned 141 web pages. Sept 2009 Sept 2009 http://www.searchengineshowdown.com/stats/overlap.shtml http://marketshare.hitslink.com/search-engine-market-share.aspx?qprid=4 13 14 Web search basics W b W b Web search basics h b h b i i Web Crawling Basics W b C Web Crawling Basics W b C li li B B i i Start with a “seed set” of to-visit urls Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 User Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! pp g All models. Helpful advice. www.best-vacuum.com to visit urls get next url Web Results 1 - 10 of about 7,310,000 for miele . ( 0.12 seconds) Miele , Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... Web crawler www. miele .com/ - 20k - Cached - Similar pages Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages get page Web eb visited urls visited urls Search Indexer extract urls web pages The Web Indexes Ad indexes 15 16

  5. C C Crawling Issues Crawling Issues li li I I Web Advertising Web Advertising W b Ad W b Ad ti i ti i  Banner ads (1995 2001)  Banner ads (1995-2001)  Load on web servers  L ad n web servers  Initial form of web advertising  E.g., no more than 1 request to the same server every 10 seconds  Popular websites charged X$ for every 1000 impressions” of ad P p l bsit s h d X$ f 1000 “imp ssi ns” f d  Insufficient resources to crawl entire web  Modeled similar to TV, magazine ads  Visit “important” pages first (pagerank, inlinks …)  L  Low clickthrough rates li kth u h t s  How to keep crawled pages “fresh”?  low ROI for advertisers  How often do web pages change? What do we mean by freshness? p g g y  Introduced by Overture around 2000 I t d d b O t d 2000  Detecting replicated content e.g., mirrors  Advertisers “bid” on search keywords  Use document comparison techniques (java manuals)  Use document comparison techniques (java manuals)  When someone searches for that keyword, the highest bidder’s ad  Can’t crawl the web from one machine is shown  Advertiser is charged only if the ad is clicked on  Advertiser is charged only if the ad is clicked on  Parallelizing the crawl P ll li i th l 17 18 Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy Web Mining Taxonomy W b Ad Web Advertising Web Advertising W b Ad ti i ti i  Search advertising is the revenue model Web Mining  Multi-billion-dollar industry  Multi-billion-dollar industry  Advertisers pay for clicks on their ads  Interesting problems  What ads to show for a search?  What ads to show for a search?  Maximise revenue, each advertiser has a limited budget Web Web Web Usage Web Usage C Content S Structure  If I’m an advertiser, which search terms should I bid on and Mining Mining Mining how much to bid? 19 20

