What is Web Mining? What is Web Mining? Web Mining Web Mining � Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) Web mining aims to discovery useful information or Based on several presentations found on the web: knowledge from the Web hyperlink structure, page Shapiro, Ullman, Terziyan, Pedersen ... content and usage data. (Bing LIU 2007, Web Data Mining, Springer) 1 2 What is Web Mining? What is Web Mining? Different from “ Different from “classical classical” ” Data Mining? Data Mining? � Motivation / Opportunity � The web is not a relation � The WWW is huge, widely distributed, global information service centre and, therefore, constitutes a rich source for data mining � Textual information + linkage structure � Intelligent Web Search � Personalization, Recommendation Engines � Usage data is huge and growing rapidly � Web-commerce applications � Building the Semantic Web � Google’s usage logs are bigger than their web crawl � Web page classification and categorization � Data generated per day is comparable to largest conventional � News classification and clustering data warehouses � Information / trend monitoring � Analysis of online communities � Web and mail spam filtering 3 4
Size of the Web October 2006 Web Server Survey Size of the Web October 2006 Web Server Survey � Number of pages � 11.5 billion indexable pages ( http://www.cs.uiowa.edu/~asignori/web-size/ www2005 ) � Technically, infinite � Because of dynamically generated content � Lots of duplication (30-40%) � Best estimate of “unique” static HTML pages comes from search engine claims � Yahoo = claimed 19.2 billion in Aug 2005 � Number of unique web sites � Netcraft survey says 98 million sites http://news.netcraft.com/archives/web_server_survey.html 5 6 Abundance and authority crisis One way to estimate the web size Abundance and authority crisis One way to estimate the web size � The number of web servers was estimated by sampling � Liberal and informal culture of content generation and dissemination and testing random IP address numbers and determining the fraction of such tests that successfully located a � Redundancy and non-standard form and content web server � Millions of qualifying pages for most broad queries � The estimate of the average number of pages per � Example: java or kayaking server was obtained by crawling a sample of the servers � No authoritative information about the reliability of a site identified in the first experiment � Little support for adapting to the background of specific users � Pages added continuously and average page changes in a few Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the weeks web. Nature , 400(6740): 107–109. 7 8
Web Information Retrieval Web Information Retrieval Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? � The Abundance Problem (99% of information of no interest to 99% � According to most predictions, the majority of human information of people) will be available on the Web in ten??? years � Hundreds of irrelevant documents returned in response to a search query � Effective information retrieval can aid in � Limited Coverage of the Web (Internet sources hidden behind � Research: Find all papers about web mining search interfaces) � Health/ Medicine : What could be reason for symptoms of “yellow � Largest crawlers cover less than 18% of Web pages eyes”, high fever and frequent vomiting � The Web is extremely dynamic � Travel: Find information on the tropical island of St. Lucia � Lots of pages added, removed and changed every day � Business: Find companies that manufacture digital signal processors � Very high dimensionality (thousands of dimensions) � Entertainment: Find all movies starring Marilyn Monroe during the � Limited query interface based on keyword-oriented search years 1960 and 1970 � Arts: Find all short stories written by Jhumpa Lahiri � Limited customization to individual users 9 10 Search Landscape 2005 Search Engine Web Coverage Overlap Search Landscape 2005 Search Engine Web Coverage Overlap Four major “Mainframes” � Google,Yahoo, MSN, and ASK � >450M searches daily � 4 searches were 60% international � defined that Thousands of machines � returned 141 web $8+B in Paid Search Revenues pages. � Large indices � Billions of documents � Terrabytes of data � Excellent relevance � For some tasks � http://www.searchengineshowdown.com/stats/overlap.shtml 11 12
Web search basics Web Crawling Basics Web search basics Web Crawling Basics Start with a “seed set” of to-visit urls Sponsored Links CG Appliance Express Discount Appliances (650) 756-3931 User Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele -Free Air shipping! All models. Helpful advice. www.best-vacuum.com to visit urls Web Results 1 - 10 of about 7,310,000 for miele . ( 0.12 seconds) get next url Miele , Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele . ... USA. to miele .com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www. miele .com/ - 20k - Cached - Similar pages Web crawler Miele Welcome to Miele , the home of the very best appliances and kitchens in the world. www. miele .co.uk/ - 3k - Cached - Similar pages Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www. miele .de/ - 10k - Cached - Similar pages Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www. miele .at/ - 3k - Cached - Similar pages get page Web visited urls Search Indexer extract urls web pages The Web Indexes Ad indexes 13 14 Crawling Issues Web Advertising Crawling Issues Web Advertising � Banner ads (1995-2001) � Load on web servers � Initial form of web advertising � E.g., no more than 1 request to the same server every 10 seconds � Popular websites charged X$ for every 1000 “impressions” of ad � Insufficient resources to crawl entire web � Modeled similar to TV, magazine ads � Visit “important” pages first (pagerank, inlinks …) � Low clickthrough rates � How to keep crawled pages “fresh”? � low ROI for advertisers � How often do web pages change? What do we mean by freshness? � Introduced by Overture around 2000 � Detecting replicated content e.g., mirrors � Advertisers “bid” on search keywords � Use document comparison techniques (java manuals) � When someone searches for that keyword, the highest bidder’s ad � Can’t crawl the web from one machine is shown � Advertiser is charged only if the ad is clicked on � Parallelizing the crawl 15 16
Web Mining Taxonomy Web Mining Taxonomy Web Advertising Web Advertising � Search advertising is the revenue model Web Mining � Multi-billion-dollar industry � Advertisers pay for clicks on their ads � Interesting problems � What ads to show for a search? � Maximise revenue, each advertiser has a limited budget Web Web Web Usage Content Structure � If I’m an advertiser, which search terms should I bid on and Mining Mining Mining how much to bid? 17 18 Web Mining Taxonomy Web Mining Taxonomy � Web content mining: focuses on techniques for assisting a user in finding documents that meet a Web Content Mining Web Content Mining certain criterion � Web structure mining: aims at developing techniques to take advantage of the collective judgement of web page quality which is available in the form of hyperlinks Examines the content of web pages as well as results of web searching. � Web usage mining: focuses on techniques to study the user behaviour when navigating the web (also known as Web log mining and clickstream analysis) 19 20
Recommend
More recommend