web usage mining
play

Web Usage Mining from Bing Liu. Web Data Mining: Exploring - PowerPoint PPT Presentation

Web Usage Mining from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M. Spiliopoulou 1 Data e Web


  1. Web Usage Mining from Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data”, Springer Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M. Spiliopoulou 1 Data e Web Mining - S. Orlando

  2. Introduction  Web usage mining – automatic discovery of patterns in clickstreams and associated data, collected or generated as a result of user interactions with one or more Web sites.  Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.  The discovered patterns are usually represented as – collections of pages, objects, or resources that are frequently accessed by groups of users with common interests. 2 Data e Web Mining - S. Orlando

  3. Introduction  Data in Web Usage Mining: – Web server logs – Site contents – Data about the visitors, gathered from external channels – Further application data  Not all these data are always available.  When they are, they must be integrated.  A large part of Web usage mining is about processing usage/ clickstream data. – After that various data mining algorithm can be applied. 3 Data e Web Mining - S. Orlando

  4. Web server logs 4 Data e Web Mining - S. Orlando

  5. Terminology and level of abstractions 5 Data e Web Mining - S. Orlando

  6. Web usage mining (simplified view) 6 Data e Web Mining - S. Orlando

  7. Web usage mining process 7 Data e Web Mining - S. Orlando

  8. Data preparation 8 Data e Web Mining - S. Orlando

  9. Data cleaning, fusion  Data cleaning – remove irrelevant references and fields in server logs – remove references due to spider/robot navigation – remove erroneous references – add missing references due to caching (done after sessionization)  Data fusion/integration – synchronize data from multiple server logs – integrate e-commerce and application server data – integrate meta-data (e.g., content labels) – integrate demographic / registration data 9 Data e Web Mining - S. Orlando

  10. Data transformation  Data Transformation – user identification – sessionization – pageview identification • a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser – episode identification  Data Reduction – sampling and dimensionality reduction (ignoring certain pageviews / items)  Identifying User Transactions – i.e., sets or sequences of pageviews possibly with associated weights 10 Data e Web Mining - S. Orlando

  11. Identify sessions (Sessionization)  Quality of the patterns discovered in KDD depends on the quality of the data on which mining is applied.  In Web usage analysis, these data are the sessions of the site visitors – the activities performed by a user from the moment she enters the site until the moment she leaves it.  Difficult to obtain reliable usage data due to – proxy servers and anonymizers, – dynamic IP addresses, – missing references due to caching, and – the inability of servers to distinguish among different visits. 11 Data e Web Mining - S. Orlando

  12. Sessionization strategies  Session reconstruction = correct mapping of activities to different individuals + correct separation of activities belonging to different visits of the same individual 12 Data e Web Mining - S. Orlando

  13. User identification 13 Data e Web Mining - S. Orlando

  14. Session uncertainty: evaluate Real vs. Re- constructed sessions 14 Data e Web Mining - S. Orlando

  15. User identification: an example Combination of IP address and Agent fields in Web logs 15 Data e Web Mining - S. Orlando

  16. Sessionization heuristics Also called structure-oriented: use either the static structure of the site, or the implicit linkage structure inferred from the referrer fields 16 Data e Web Mining - S. Orlando

  17. Sessionization example: time-oriented heuristic 17 Data e Web Mining - S. Orlando

  18. Pageview identification  Pageview identification – Depends on the intra-page structure of sites – Identify the collection of Web files/objects/resources representing a specific “user event” corresponding to a click- through (e.g. viewing a product page, adding a product to a shopping cart) – In some cases it may be nice to consider pageviews at a higher level of aggregation • e.g. they may correspond to many user event related to the same concept category, like the purchase of a product on an online e- commerce site 18 Data e Web Mining - S. Orlando

  19. Path completion  Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached.  For instance, – if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. – This results in the second reference to A not being recorded on the server logs. 19 Data e Web Mining - S. Orlando

  20. Path completion  Path completion: – How to infer missing user references due to caching.  Effective path completion requires extensive knowledge of the link structure within the site  Referrer information in server logs can also be used in disambiguating the inferred paths.  Problem gets much more complicated in frame-based sites. 20 Data e Web Mining - S. Orlando

  21. Missing references due to caching  Reconstruction by using the knowledge about the site structure – also inferred from the the referrer fields  Many paths are possible – usually the selected path is the one requiring the fewest number of “back” reference 21 Data e Web Mining - S. Orlando

  22. Data modeling for Web Usage Mining  Data preprocessing produces – a set of pageviews: P={p 1 , …, p n } – a set of user transactions: T={t 1 , …, t m } where each transaction t i contains a subset of P – Each transaction: t )), … ,( p l t , w ( p 1 t )),( p 2 t , w ( p 2 t , w ( p l t )) t = ( p 1 is a l -length ordered sequence of pageviews, where each w corresponds to a weight, e.g. the significance of the pageview – In collaborative filtering these weights correspond to explicit user ratings – In Web collected transactions , the duration of the page visit in the session 22 Data e Web Mining - S. Orlando

  23. Data modeling for Web Usage Mining (cont.)  In many mining tasks, the sequential ordering of the transactions is not important (e.g.: clustering, association rule extractions)  In this case a transaction can be represented as an n -length vector : t ) t , w 2 t , … , w n t = ( w 1 where the weight is 0 if the corresponding page is not present in t , otherwise correspond to the significance of the page in the t page A page B page C page D page E user 0 15 4 1 0 0 m × n user- pageviews user 1 2 0 25 0 0 matrix (or user 2 200 1 0 0 3 transaction user 3 56 0 0 4 4 matrix) user 4 0 0 23 50 0 user 5 0 0 5 3 0 23 Data e Web Mining - S. Orlando

  24. Data modeling for Web Usage Mining (cont.)  Given a user-pageview matrix, a number of unsupervised mining techniques can be exploited page A page B page C page D page E user 0 15 4 1 0 0 m × n user- pageviews user 1 2 0 25 0 0 matrix (or user 2 200 1 0 0 3 transaction user 3 56 0 0 4 4 matrix) user 4 0 0 23 50 0 user 5 0 0 5 3 0  Clustering of transactions/sessions to determine important visitor segments  Clustering of pageviews (items) expressed in terms of user judgments, , to discover important relationships between pageviews (items)  Sequential (timestamps must be maintained) and non sequential association rules, to discover important relationships between pageviews (items) 24 Data e Web Mining - S. Orlando

  25. Data modeling for Web Usage Mining (cont.)  Automatic integration of content information – textual features from the Web contents represent the underlying semantics of the pages – aiming to transform a user-pageviews matrix into a content-enhanced transaction matrix food news car house party sky page A 0 1 1 0 0 0 n × r page B 1 0 0 1 0 0 pageviews- terms page C 1 1 0 0 0 0 matrix page D 0 0 1 0 0 1 page E 0 0 0 1 1 0 25 Data e Web Mining - S. Orlando

  26. Data modeling for Web Usage Mining (cont.) food news car house party sky page A 0 1 1 0 0 0 n × r page B 1 0 0 1 0 0 pageviews- P= terms page C 1 1 0 0 0 0 matrix page D 0 0 1 0 0 1 page E 0 0 0 1 1 0 page A page B page C page D page E user 0 1 1 0 0 0 m × n user- user 1 0 0 1 0 0 pageviews U= user 2 1 0 0 0 1 matrix (or user 3 1 0 0 1 1 transaction user 4 0 0 1 1 0 matrix) user 5 0 0 1 0 0 food news car house party sky user 0 1 1 1 1 0 0 m × r user 1 1 1 0 0 0 0 content- enhanced U × P = user 2 0 1 1 1 1 0 transaction user 3 0 1 2 1 1 1 matrix user 4 1 1 1 0 0 1 26 user 5 1 1 0 0 0 0 Data e Web Mining - S. Orlando

Recommend


More recommend