Effective Change Detection Using Sampling Junghoo “John” Cho Alexandros Ntoulas UCLA
Problem Polling Update Query Local database Remote database � Application � Web search engines/crawlers � Web archive � Data warehouse . . . Junghoo "John" Cho (UCLA Computer Science) 2
Existing Approach � Round robin � Download pages in a round robin manner � Change-frequency based [CLW98, CGM00, EMT01] � Estimate the change frequency � Adjust download frequency � Proven to be optimal Junghoo "John" Cho (UCLA Computer Science) 3
Our Approach � Sampling-based � Sample k pages from each source � Download more pages from the source with more changed samples Junghoo "John" Cho (UCLA Computer Science) 4
Comparison � Frequency based � Proven to be optimal � Change history required � Difficult to estimate change frequency � Sampling based � Can be worse than frequency based policy � No history/frequency-estimation required � Experimental comparison later Junghoo "John" Cho (UCLA Computer Science) 5
Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 6
Is Correlation Necessary? � Random sampling 4/5 1/5 � Correlation not necessary. Only random sampling � More discussion later Junghoo "John" Cho (UCLA Computer Science) 7
Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 8
Download Model (1) � Fixed download cycle � Say, once a month � Fixed download resources in each cycle � Say, 100,000 page download every month � Goal � Download as many changes as we can � ChangeRatio = No of changed & downloaded pages No of downloaded pages Junghoo "John" Cho (UCLA Computer Science) 9
Download Model (2) � Two-stage sampling policy � Sampling stage � Download stage � Sampling requires page download Junghoo "John" Cho (UCLA Computer Science) 10
How to Use Sampling Result? � Sites A and B, each with 20 pages � 20 total download, 5 samples from each site � 10 page download remaining 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 11
Proportional Policy � Download pages proportionally to the detected changes � 8 pages from A, 2 pages from B 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 12
Greedy Policy � Download pages from the sites with most changes � 10 pages from A 1/5 A B 4/5 Junghoo "John" Cho (UCLA Computer Science) 13
Optimality of Greedy � Theorem � Greedy is optimal if we make download decisions purely based on sampling results � Probabilistic optimality for their expected values Junghoo "John" Cho (UCLA Computer Science) 14
Questions � Are we assuming correlation? � How to use sampling results? � Proportional vs Greedy � How many samples? � Dynamic sample size adjustment? � What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 15
How Many Samples? � Too few samples � Inaccurate change estimates � Too many samples � “Waste” of resources for sampling � How to determine optimal sample size? Junghoo "John" Cho (UCLA Computer Science) 16
Optimal Sample Size � Factors to consider � Total number of pages that we maintain � Number of pages that we can download in the current cycle � Number of pages in each Web site � Change distribution � Scenario 1 -- A: 90/100, B: 10/100 � Scenario 2 -- A: 60/100, B: 40/100 Junghoo "John" Cho (UCLA Computer Science) 17
Change Fraction Distribution fraction of sites f( ρ ) ρ ρ t � ρ i : fraction of changed pages in site i � f( ρ ): distribution of ρ values Junghoo "John" Cho (UCLA Computer Science) 18
Optimal Sample Size Nr f ( ρ t ) 6( ρ r − ρ ) � N : no of pages in a site � r : no of pages to download / no of pages we maintain � Analysis is complex Nr is a good rule of thumb � Junghoo "John" Cho (UCLA Computer Science) 19
Dynamic Sample Size? � Do we need the same sample size for every site? � A: ρ = 0, B: ρ = 0.45, C: ρ = 0.55, D: ρ = 1 Junghoo "John" Cho (UCLA Computer Science) 20
Adaptive Sampling � If the estimated ρ is high/low enough, make an early decision � What does “high enough” mean? � Confidence interval above threshold ( ) ( ) ( ) ρ i ρ i ρ i ρ ρ t Junghoo "John" Cho (UCLA Computer Science) 21
In the Paper � More details on � Optimal sample size � Adaptive policy � The cases where resource is too limited for sampling Junghoo "John" Cho (UCLA Computer Science) 22
Experiments � 353,000 pages from 252 sites � Mostly popular sites � Yahoo, CNN, Microsoft, … � ~ 1400 pages from each site � Followed the links in the breadth-first manner � Monthly change history for 6 months � 5 download cycles � In experiments, 100,000 page downloads in each download cycle Junghoo "John" Cho (UCLA Computer Science) 23
Comparison of Policies ChangeRatio 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 RR FRQ PRP GRD ADP Junghoo "John" Cho (UCLA Computer Science) 24
Optimal Sample Size ChangeRatio 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0 50 100 150 200 250 Sample Size Optimal sample size ~ 10 through 60 ~ 20 Nr Junghoo "John" Cho (UCLA Computer Science) 25
Comparison of Long-Term Performance � Problem: We have only 5-download-cycle ? data � Solution: Extrapolate the history Repeat Junghoo "John" Cho (UCLA Computer Science) 26
Frequency vs. Sampling ChangeRatio 0.9 Frequency 0.8 Greedy 0.7 0.6 0.5 0 100 200 300 400 Download Cycle Junghoo "John" Cho (UCLA Computer Science) 27
Related Work � Frequency-based policy � Coffman et al., Journal of Scheduling 1998 � Cho et al., SIGMOD 2000 � Edwards et al., WWW 2001 � Source cooperation � Olston et al., SIGMOD 2002 Junghoo "John" Cho (UCLA Computer Science) 28
Conclusion � Sampling-based policy � Great short-term performance � No change history required � Frequency-based policy � Potentially good long-term performance if the change frequency does not change � Greedy is easy to implement and shows high performance Junghoo "John" Cho (UCLA Computer Science) 29
Future Work � Combination of sampling and frequency based policies � Switch to the frequency-based policy after a while � Good partitioning for sampling? � Site based? Directory based? � Content based? � Link-structure based? Junghoo "John" Cho (UCLA Computer Science) 30
Recommend
More recommend