Web Usage Mining from Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data”, Springer Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M. Spiliopoulou 1 Data e Web Mining - S. Orlando
Introduction Web usage mining – automatic discovery of patterns in clickstreams and associated data, collected or generated as a result of user interactions with one or more Web sites. Goal: analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as – collections of pages, objects, or resources that are frequently accessed by groups of users with common interests. 2 Data e Web Mining - S. Orlando
Introduction Data in Web Usage Mining: – Web server logs – Site contents – Data about the visitors, gathered from external channels – Further application data Not all these data are always available. When they are, they must be integrated. A large part of Web usage mining is about processing usage/ clickstream data. – After that various data mining algorithm can be applied. 3 Data e Web Mining - S. Orlando
Web server logs 4 Data e Web Mining - S. Orlando
Terminology and level of abstractions 5 Data e Web Mining - S. Orlando
Web usage mining (simplified view) 6 Data e Web Mining - S. Orlando
Web usage mining process 7 Data e Web Mining - S. Orlando
Data preparation 8 Data e Web Mining - S. Orlando
Data cleaning, fusion Data cleaning – remove irrelevant references and fields in server logs – remove references due to spider/robot navigation – remove erroneous references – add missing references due to caching (done after sessionization) Data fusion/integration – synchronize data from multiple server logs – integrate e-commerce and application server data – integrate meta-data (e.g., content labels) – integrate demographic / registration data 9 Data e Web Mining - S. Orlando
Data transformation Data Transformation – user identification – sessionization – pageview identification • a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser – episode identification Data Reduction – sampling and dimensionality reduction (ignoring certain pageviews / items) Identifying User Transactions – i.e., sets or sequences of pageviews possibly with associated weights 10 Data e Web Mining - S. Orlando
Identify sessions (Sessionization) Quality of the patterns discovered in KDD depends on the quality of the data on which mining is applied. In Web usage analysis, these data are the sessions of the site visitors – the activities performed by a user from the moment she enters the site until the moment she leaves it. Difficult to obtain reliable usage data due to – proxy servers and anonymizers, – dynamic IP addresses, – missing references due to caching, and – the inability of servers to distinguish among different visits. 11 Data e Web Mining - S. Orlando
Sessionization strategies Session reconstruction = correct mapping of activities to different individuals + correct separation of activities belonging to different visits of the same individual 12 Data e Web Mining - S. Orlando
User identification 13 Data e Web Mining - S. Orlando
Session uncertainty: evaluate Real vs. Re- constructed sessions 14 Data e Web Mining - S. Orlando
User identification: an example Combination of IP address and Agent fields in Web logs 15 Data e Web Mining - S. Orlando
Sessionization heuristics Also called structure-oriented: use either the static structure of the site, or the implicit linkage structure inferred from the referrer fields 16 Data e Web Mining - S. Orlando
Sessionization example: time-oriented heuristic 17 Data e Web Mining - S. Orlando
Pageview identification Pageview identification – Depends on the intra-page structure of sites – Identify the collection of Web files/objects/resources representing a specific “user event” corresponding to a click- through (e.g. viewing a product page, adding a product to a shopping cart) – In some cases it may be nice to consider pageviews at a higher level of aggregation • e.g. they may correspond to many user event related to the same concept category, like the purchase of a product on an online e- commerce site 18 Data e Web Mining - S. Orlando
Path completion Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached. For instance, – if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. – This results in the second reference to A not being recorded on the server logs. 19 Data e Web Mining - S. Orlando
Path completion Path completion: – How to infer missing user references due to caching. Effective path completion requires extensive knowledge of the link structure within the site Referrer information in server logs can also be used in disambiguating the inferred paths. Problem gets much more complicated in frame-based sites. 20 Data e Web Mining - S. Orlando
Missing references due to caching Reconstruction by using the knowledge about the site structure – also inferred from the the referrer fields Many paths are possible – usually the selected path is the one requiring the fewest number of “back” reference 21 Data e Web Mining - S. Orlando
Data modeling for Web Usage Mining Data preprocessing produces – a set of pageviews: P={p 1 , …, p n } – a set of user transactions: T={t 1 , …, t m } where each transaction t i contains a subset of P – Each transaction: t )), … ,( p l t , w ( p 1 t )),( p 2 t , w ( p 2 t , w ( p l t )) t = ( p 1 is a l -length ordered sequence of pageviews, where each w corresponds to a weight, e.g. the significance of the pageview – In collaborative filtering these weights correspond to explicit user ratings – In Web collected transactions , the duration of the page visit in the session 22 Data e Web Mining - S. Orlando
Data modeling for Web Usage Mining (cont.) In many mining tasks, the sequential ordering of the transactions is not important (e.g.: clustering, association rule extractions) In this case a transaction can be represented as an n -length vector : t ) t , w 2 t , … , w n t = ( w 1 where the weight is 0 if the corresponding page is not present in t , otherwise correspond to the significance of the page in the t page A page B page C page D page E user 0 15 4 1 0 0 m × n user- pageviews user 1 2 0 25 0 0 matrix (or user 2 200 1 0 0 3 transaction user 3 56 0 0 4 4 matrix) user 4 0 0 23 50 0 user 5 0 0 5 3 0 23 Data e Web Mining - S. Orlando
Data modeling for Web Usage Mining (cont.) Given a user-pageview matrix, a number of unsupervised mining techniques can be exploited page A page B page C page D page E user 0 15 4 1 0 0 m × n user- pageviews user 1 2 0 25 0 0 matrix (or user 2 200 1 0 0 3 transaction user 3 56 0 0 4 4 matrix) user 4 0 0 23 50 0 user 5 0 0 5 3 0 Clustering of transactions/sessions to determine important visitor segments Clustering of pageviews (items) expressed in terms of user judgments, , to discover important relationships between pageviews (items) Sequential (timestamps must be maintained) and non sequential association rules, to discover important relationships between pageviews (items) 24 Data e Web Mining - S. Orlando
Data modeling for Web Usage Mining (cont.) Automatic integration of content information – textual features from the Web contents represent the underlying semantics of the pages – aiming to transform a user-pageviews matrix into a content-enhanced transaction matrix food news car house party sky page A 0 1 1 0 0 0 n × r page B 1 0 0 1 0 0 pageviews- terms page C 1 1 0 0 0 0 matrix page D 0 0 1 0 0 1 page E 0 0 0 1 1 0 25 Data e Web Mining - S. Orlando
Data modeling for Web Usage Mining (cont.) food news car house party sky page A 0 1 1 0 0 0 n × r page B 1 0 0 1 0 0 pageviews- P= terms page C 1 1 0 0 0 0 matrix page D 0 0 1 0 0 1 page E 0 0 0 1 1 0 page A page B page C page D page E user 0 1 1 0 0 0 m × n user- user 1 0 0 1 0 0 pageviews U= user 2 1 0 0 0 1 matrix (or user 3 1 0 0 1 1 transaction user 4 0 0 1 1 0 matrix) user 5 0 0 1 0 0 food news car house party sky user 0 1 1 1 1 0 0 m × r user 1 1 1 0 0 0 0 content- enhanced U × P = user 2 0 1 1 1 1 0 transaction user 3 0 1 2 1 1 1 matrix user 4 1 1 1 0 0 1 26 user 5 1 1 0 0 0 0 Data e Web Mining - S. Orlando
Recommend
More recommend