web mining web mining overview overview
play

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr - PowerPoint PPT Presentation

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline Web Mining Outline Goal: Examine the use of data mining on Examine the use of data mining on Goal: the World Wide Web the World Wide Web Web


  1. Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1

  2. Web Mining Outline Web Mining Outline Goal: Examine the use of data mining on Examine the use of data mining on Goal: the World Wide Web the World Wide Web � Web Data Web Data � � Web Content Mining Web Content Mining � � Web Structure Mining Web Structure Mining � � Web Usage Mining Web Usage Mining � � Common Web Mining Techniques Common Web Mining Techniques � � Research Directions Research Directions � 2

  3. Web Data Web Data � Web pages Web pages � � Page structures Page structures � � Usage data Usage data � � Supplemental data Supplemental data � – Profiles Profiles – – Registration information Registration information – – Cookies Cookies – 3

  4. Web Mining Taxonomy Web Mining Taxonomy Modified from [zai01] 4

  5. Web Content Mining (1) Web Content Mining (1) � The lack of structure that permeates the information sources on the World Wide Web makes automated discovery of Web-based information difficult � In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, and to extend data mining techniques to provide a higher level of organization for semi-structured data available on the Web 5

  6. Web Content Mining (2) Web Content Mining (2) � Techniques for Web content mining can be Techniques for Web content mining can be � classified into: classified into: – Agent Based Approach – Agent Based Approach » Intelligent Search Agents using domain characteristics Intelligent Search Agents using domain characteristics » » » Information Filtering/ Categorization using information Information Filtering/ Categorization using information retrieval techniques retrieval techniques » Personalized Web Agents using user preference » Personalized Web Agents using user preference – Database Approach Database Approach – » Multilevel Databases which extracts meta data from Multilevel Databases which extracts meta data from » lower level data and organize in a structured collection lower level data and organize in a structured collection » Web Query Systems that uses SQL Web Query Systems that uses SQL- -like to extract web like to extract web » document structure, and content queries using IR document structure, and content queries using IR techniques techniques 6

  7. Web Structure Mining (1) Web Structure Mining (1) � Mine structure (links, graph) of the Web Mine structure (links, graph) of the Web � � Techniques Techniques � – PageRank – PageRank – CLEVER CLEVER – � Create a model of the Web organization. Create a model of the Web organization. � � May be combined with content mining to May be combined with content mining to � more effectively retrieve important pages. more effectively retrieve important pages. 7

  8. Web Structure Mining (2) Web Structure Mining (2) � PageRank PageRank � – Used by Used by Google Google – – – Prioritize pages returned from search by Prioritize pages returned from search by looking at Web structure. looking at Web structure. – Importance of page is calculated based on Importance of page is calculated based on – number of pages which point to it – – number of pages which point to it Backlinks . . Backlinks – Weighting is used to provide more Weighting is used to provide more – importance to backlinks importance to backlinks coming from coming from important pages. important pages. 8

  9. Web Structure Mining (3) Web Structure Mining (3) � CLEVER Identifies authoritative and CLEVER Identifies authoritative and � hub pages. hub pages. – Authoritative Pages Authoritative Pages : : – » Highly important pages. Highly important pages. » » Best source for requested information. Best source for requested information. » – – Hub Pages Hub Pages : : » Contain links to highly important pages. Contain links to highly important pages. » 9

  10. Web Usage Mining (1) Web Usage Mining (1) � Web usage mining is the automatic discovery of user access patterns from Web servers � Organizations collect large volumes of data in their daily operations, generated automatically by Web servers and collected in server access logs. � Other sources of user information include referrer logs which contain information about the referring pages for each page reference, and user registration or survey data gathered via CGI scripts. 10

  11. Web Usage Mining (2) Web Usage Mining (2) � Techniques for Web content mining can Techniques for Web content mining can � be classified into: be classified into: – Pattern Discovery Tools using techniques Pattern Discovery Tools using techniques – from AI, data mining, and information from AI, data mining, and information retrieval to mine for knowledge from retrieval to mine for knowledge from collected data collected data – Pattern Analysis Tools are needed to Pattern Analysis Tools are needed to – understand, visualize, and interpret these understand, visualize, and interpret these patterns patterns 11

  12. Common Web Mining Common Web Mining Techniques Techniques � The common techniques for Web The common techniques for Web � mining are: mining are: – clustering clustering – – – classification, classification, – association rules, association rules, – – – path analysis, and path analysis, and – sequential patterns. sequential patterns. – 12

  13. Clustering Clustering � Clustering analysis allows one to group together clients or data items that have similar characteristics. � Clustering of client information or data items on Web transaction logs, can facilitate the development and execution of future marketing strategies, both online and off-line, such as: – automated return mail to clients falling within a certain cluster, or – dynamically changing a particular site for a client, on a return visit, based on past classification of that client. 13

  14. Classification Classification � Discovering classification rules allows one to develop a profile of items belonging to a particular group according to their common attributes. � This profile can then be used to classify new data items that are added to the database. � For example, classification on WWW access logs may lead to the discovery of relationships such as the following: – clients from state or government agencies who visit the site tend to be interested in the page /company/product1 14

  15. Association Rules Association Rules � Rules that govern "databases of transactions Rules that govern "databases of transactions � where each transaction consists of a set of where each transaction consists of a set of items." items." � This technique is used to predict the This technique is used to predict the � correlation of items "where the presence of correlation of items "where the presence of one set of items in a transaction implies (with one set of items in a transaction implies (with a certain degree of confidence) the presence a certain degree of confidence) the presence of other items.“ of other items.“ � For example, prediction of the percentage of For example, prediction of the percentage of � clients accessing a particular URL who will clients accessing a particular URL who will place online orders for a certain product place online orders for a certain product 15

  16. Path Analysis Path Analysis � A technique that involves the generation of some A technique that involves the generation of some � form of graph that "represents relation[s relation[s] defined on ] defined on form of graph that "represents Web pages." Web pages." � This can be the physical layout of a Web site in which This can be the physical layout of a Web site in which � the Web pages are nodes and the hypertext links the Web pages are nodes and the hypertext links between these pages are directed edges. between these pages are directed edges. � Most graphs are involved in determining frequent Most graphs are involved in determining frequent � traversal patterns or large reference sequences from traversal patterns or large reference sequences from physical layout, such as the most frequently visited physical layout, such as the most frequently visited paths in a Web site. paths in a Web site. � For example, what paths do users travel before they For example, what paths do users travel before they � go to a particular URL? go to a particular URL? 16

  17. Sequential Patterns Sequential Patterns � Applied to Web access server Applied to Web access server � transaction logs. transaction logs. � The purpose is to discover sequential The purpose is to discover sequential � patterns that indicate user visit patterns patterns that indicate user visit patterns over a certain period. over a certain period. � For example, "30% of clients who For example, "30% of clients who � visited /company/products/, had done a visited /company/products/, had done a search in Yahoo within the past week search in Yahoo within the past week on keyword W" on keyword W" 17

Recommend


More recommend