WindMine: Fast and Effective Mining of Web-click Sequences Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara (Kyoto Univ.) Christos Faloutsos (Carnegie Mellon Univ.) SDM 2011 Y . Sakurai et al. 1
Introduction Web-click sequence applications Web masters and web-site owners - Capacity planning - Intrusion detection - Advertisement design Goal - Find meaningful patterns for web-click data (e.g., the lunch-break trend, huge spike, anomalies) - Find periodicity (daily and/or weekly, etc) - Determine suitable window sizes automatically SDM 2011 Y . Sakurai et al. 2
Introduction Examples access count from a business news site Original web-click sequence SDM 2011 Y . Sakurai et al. 3
Problem definition Web-click sequences of m URLs: ( X 1 , … , X m ) Web-click sequence X of duration n : X = ( x 1 ,…, x t ,…, x n ) Local Component Analysis: Given m sequences of duration n, ( X 1 , … , X m ) - Find patterns, main components of the sequences - Find the ‘best window’ size w for the analysis Final challenge: scalable algorithm for the local component analysis SDM 2011 Y . Sakurai et al. 4
Background Independent component analysis (ICA) - PCA vs. ICA IC 1 PC 1 IC 2 PC 2 SDM 2011 Y . Sakurai et al. 5
Why not ‘PCA’? Example of component analysis Source Mix SDM 2011 Y . Sakurai et al. 6
Why not ‘PCA’? Example of component analysis PCA ICA ICA recognizes the components successfully and separately SDM 2011 Y . Sakurai et al. 7
Main idea (1) Multi-scale local component analysis w = 2 a b c d e f g h Divide a sequence into subsequences of length w X a b c d e f g h time original sequence ˆ B X a b c d Compute the local e f components from the g h window matrix local components window matrix SDM 2011 Y . Sakurai et al. 8
Main idea (2) Best window size selection Q : How to estimate a ‘good window size’ automatically when we have multiple sequences? Proposed criterion: CEM (Component Entropy Maximization) - Estimate the optimal number of w for the sequence set - Compute the entropy of the weight values of the mixing matrix A - ‘popular’ (widely-used) components show high CEM scores SDM 2011 Y . Sakurai et al. 9
Main idea (2) CEM criterion: - CEM score of the j -th component for the window size w 1 å k : # of components = - C p log p M : # of subsequences w , j i , j i , j w = = ! ! ( i 1 , , M ; j 1 , , k ) i - Probability for the j -th component (size of the j -th component’s contribution to each subsequence) å ¢ ¢ = p a a i , j i , j i , j i - Normalized weight values for each subsequences å ¢ = 2 a a a i , j i , j i , j j A = [ a ] - Mixing matrix w i , j SDM 2011 Y . Sakurai et al. 10
WindMine-part Efficient solution Q : How do we efficiently extract the best local component from large sequence sets? Hierarchical partitioning approach: WindMine-part - Partition the original window matrix into sub-matrices - Extract local components each from the sub-matrices - Reuse the local components for the component analysis on the higher level SDM 2011 Y . Sakurai et al. 11
WindMine-part X :original sequence ... Level 1 partition ICA Level 2 partition ICA : : : window matrix sub-matrices local components SDM 2011 Y . Sakurai et al. 12
Experimental Results Experiments with real and datasets Ondemand TV, WebClick, Automobile, Temperature, Sunspots Evaluation Accuracy for pattern discovery Accuracy for the best window size Computation time SDM 2011 Y . Sakurai et al. 13
Pattern discovery Ondemand TV Weekly pattern Daily pattern access count of users Original sequence Anomaly spikes PCA: failed SDM 2011 Y . Sakurai et al. 14
Pattern discovery WebClick Increase from Q & A site morning to night Low activity during and reach a peak sleeping time Dip at dinner time Weekly pattern SDM 2011 Y . Sakurai et al. 15
Pattern discovery WebClick job-seeking site Large spike during Workers arrive at the lunch break their office Job seeking during High activity on week days a short break (daily access decreases as the weekend approaches) SDM 2011 Y . Sakurai et al. 16
Pattern discovery WebClick other websites High activity Educational site for kids Website for baby nursery 8am-11pm, weekday (they visit here after (the main users will be their (business purposes) school, 3pm) parents, rather than babies!) SDM 2011 Y . Sakurai et al. 17
Pattern discovery WebClick Access count increases other websites after meal times The users visit three The users rarely visit Access count is still high in times a day here late in the evening the night, 0am-1am (early morning, noon, (which is indeed good for (healthy diet should include early evening) their health!) an earlier bed time!) SDM 2011 Y . Sakurai et al. 18
Pattern discovery Generalization of WindMine SDM 2011 Y . Sakurai et al. 19
Choice of best window size CEM score for various window sizes SDM 2011 Y . Sakurai et al. 20
Computation time Wall clock time vs. # of subsequences - Up to 70 times faster SDM 2011 Y . Sakurai et al. 21
Computation time Wall clock time vs. duration SDM 2011 Y . Sakurai et al. 22
Conclusions Scalable pattern extraction and anomaly detection in large web-click sequences 1. Scalable, parallelizable method for breaking sequences into a few, fundamental ingredients 2. Linearly over the sequence duration, and near-linearly on the number of sequence SDM 2011 Y . Sakurai et al. 23
Recommend
More recommend