Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku Stanford University, USA August 21, 2002 VLDB 2002 1
The Problem ... The Problem ... Stream � I dent if y all element s whose current f requency exceeds support t hreshold s = 0.1%.
A Related Problem ... A Related Problem ... Stream � I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%. Frequent Itemsets / Association Rules
Applications Applications Flow Identification at IP Router [ EV01] Iceberg Queries [ FSGM+ 98] Iceberg Datacubes [ BR99 HPDW01] Association Rules & Frequent Itemsets [ AS94 SON95 Toi96 Hid99 HPY00 … ]
Presentation Outline ... Presentation Outline ... 1. Lossy Counting 2. Sticky Sampling 3. Algorithm for Frequent Itemsets
Algorithm 1: Lossy Lossy Counting Counting Algorithm 1: Step 1: Divide the stream into ‘windows’ Window 1 Window 2 Window 3 Is window size a function of support s? Will fix later…
Lossy Counting in Action ... Counting in Action ... Lossy Frequency Counts + Empty First Window At window boundary, decrement all counters by 1
Lossy Counting continued ... Counting continued ... Lossy Frequency Counts + Next Window At window boundary, decrement all counters by 1
Error Analysis Error Analysis How much do we undercount ? current size of st ream = N I f window-size = 1/ e and ≤ ≤ f requency error # windows = eN t hen Rule of t humb: Set e = 10% of support s Example: Given support f requency s = 1%, set error f requency e = 0.1%
Out put : Element s wit h count er values exceeding sN – eN Approximat ion guarant ees Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN How many count ers do we need? Worst case: 1/ e log (e N) count ers [See paper f or proof ]
Enhancements ... Enhancements ... Frequency Errors For count er (X, c), t rue f requency in [c, c+ e N ] Trick: Remember window-id’s For count er (X, c, w), t rue f requency in [c, c+w-1] If (w = 1), no error! Bat ch Processing Decrement s af t er k windows
Algorithm 2: Sticky Sampling Algorithm 2: Sticky Sampling Stream 28 34 31 15 � Creat e count ers by sampling 41 30 23 � Maint ain exact count s t hereaf t er 35 19 What rate should we sample?
Sticky Sampling contd... Sticky Sampling contd... For f init e st ream of lengt h N log 1/ (s δ ) δ = probabilit y of f ailure Sampling rat e = 2/ Ne Out put : Element s wit h count er values exceeding sN – eN Approximat ion guarant ees (probabilist ic) Frequencies underest imat ed by at most eN No f alse negat ives False posit ives have t rue f requency at least sN – eN Same Rule of t humb: Same error guarant ees Set e = 10% of support s as Lossy Count ing Example: but probabilist ic Given support t hreshold s = 1%, set error t hreshold e = 0.1% set f ailure probabilit y δ = 0.01%
Sampling rate? Sampling rate? Finit e st ream of lengt h N log 1/ (s δ ) Sampling rat e: 2/ Ne I nf init e st ream wit h unknown N Gradually adj ust sampling rat e (see paper f or det ails) I n eit her case, Expect ed number of count ers = 2/ ε log 1/ s δ I ndependent of N!
Sticky Sampling Expected: 2/ ε log 1/ s δ Lossy Counting Worst Case: 1/ ε log ε N Support s = 1% Error e = 0.1% No of counters Log10 of N (stream length) No of counters N (stream length)
From elements to sets of elements …
Frequent Itemsets Itemsets Problem ... Problem ... Frequent Stream � I dent if y all subset s of it ems whose current f requency exceeds s = 0.1%. Frequent Itemsets = > Association Rules
Three Modules Three Modules TRIE SUBSET-GEN BUFFER
Module 1: TRIE TRIE Module 1: Compact represent at ion of f requent it emset s in lexicographic order. 45 50 40 31 29 32 42 30 50 40 30 31 29 45 32 42 Sets with frequency counts
Module 2: BUFFER BUFFER Module 2: Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 In Main Memory Compact represent at ion as sequence of int s Transact ions sort ed by it em-id Bit map f or t ransact ion boundaries
Module 3: SUBSET SUBSET- - GEN GEN Module 3: 3 3 3 4 2 2 1 2 1 3 Frequency count s 1 of subset s 1 in lexicographic order BUFFER
Overall Algorithm ... Overall Algorithm ... 3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN BUFFER TRIE new TRIE Problem: Number of subsets is exponential!
SUBSET- - GEN Pruning Rules GEN Pruning Rules SUBSET A-priori Pruning Rule I f set S is inf requent , every superset of S is inf requent . Lossy Count ing Pruning Rule At each ‘window boundary’ decrement TRI E count ers by 1. Act ually, ‘Bat ch Delet ion’: At each ‘main memory buf f er’ boundary, decrement all TRI E count ers by b. See paper for details ...
Bottlenecks ... Bottlenecks ... 3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN BUFFER TRIE new TRIE Consumes main memory Consumes CPU time
Design Decisions for Performance Design Decisions for Performance TRIE Main memory bottleneck Compact linear array � (element, counter, level) in preorder traversal � No pointers! Tries are on disk � All of main memory devoted to BUFFER Pair of tries � old and new (in chunks) mmap() and madvise() SUBSET-GEN CPU bottleneck Very fast implementation � See paper for details
Experiments ... Experiments ... IBM synthetic dataset T10.I4.1000K N = 1Million Avg Tran Size = 10 Input Size = 49MB IBM synthetic dataset T15.I6.1000K N = 1Million Avg Tran Size = 15 Input Size = 69MB Frequent word pairs in 100K web documents N = 100K Avg Tran Size = 134 Input Size = 54MB Frequent word pairs in 806K Reuters newsreports N = 806K Avg Tran Size = 61 Input Size = 210MB
What do we study? What do we study? For each dat aset Support t hreshold s Three independent variables Lengt h of st ream N Fix one and vary two BUFFER size B Time t aken t Measure time taken Set e = 10% of support s
Varying support s and BUFFER B Varying support s and BUFFER B Time in seconds Time in seconds S = 0.004 S = 0.008 S = 0.001 S = 0.012 S = 0.002 S = 0.016 S = 0.004 S = 0.020 S = 0.008 BUFFER size in MB BUFFER size in MB IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s
Varying length N and support s Varying length N and support s Time in seconds S = 0.001 Time in seconds S = 0.002 S = 0.001 S = 0.004 S = 0.002 S = 0.004 Length of stream in Thousands Length of stream in Thousands IBM 1M transactions Reuters 806K docs Fixed: BUFFER size B Varying: Stream length N Support threshold s
Varying BUFFER B and support s Varying BUFFER B and support s Time in seconds Time in seconds B = 4 MB B = 4 MB B = 16 MB B = 16 MB B = 28 MB B = 28 MB B = 40 MB B = 40 MB Support threshold s Support threshold s IBM 1M transactions Reuters 806K docs Fixed: Stream length N Varying: BUFFER size B Support threshold s
Comparison with fast A- - priori priori Comparison with fast A APriori Our Algorit hm Our Algorit hm wit h 4MB Buf f er wit h 44MB Buf f er Support Time Memory Time Memory Time Memory 0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB 0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB 0.004 14 s 48 MB 65 s 7MB 8 s 45 MB 0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB 0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB 0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB Dataset: IBM T10.I4.1000K with 1M transactions, average size 10. A-priori by Christian Borgelt http: / / fuzzy.cs.uni-magdeburg.de/ ~ borgelt/ software.html
Comparison with Iceberg Queries Comparison with Iceberg Queries Query: Identify all word pairs in 100K web documents which co-occur in at least 0.5% of the documents. [ FSGM+ 98] multiple pass algorithm: 7000 seconds with 30 MB memory Our single-pass algorithm: 4500 seconds with 26 MB memory Our algorithm would be much faster if allowed multiple passes!
Lessons Learnt ... Lessons Learnt ... Optimizing for # passes is wrong! Small support s ⇒ Too many frequent itemsets! Time to redefine the problem itself? Interesting combination of Theory and Systems.
Work in Progress ... Work in Progress ... Frequency Counts over Sliding Windows Multiple pass Algorithm for Frequent Itemsets Iceberg Datacubes
Summary Summary Lossy Counting: A Practical algorithm for online frequency counting. First ever single pass algorithm for Association Rules with user specified error guarantees. Basic algorithm applicable to several problems.
Recommend
More recommend