Identifying Frequent Items in Sliding Windows over On-Line Packet Streams Alejandro López-Ortiz School of Computer Science University of Waterloo Joint work with Lukasz Golab (Waterloo), David DeHaan (Waterloo), Erik Demaine (MIT), and J. Ian Munro (Waterloo)
Application � Real-time analysis of network traffic � find frequently appearing packet types � Packet type: port #, protocol type, source IP. � But, interested in recent usage trends � E.g. for routing system analysis or anomaly detection � So, want to find frequently appearing packets in a sliding window of N most recent packets IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 2
If we could store the entire window: � Maintain frequency counts of each category in the window � Update counters as new packets arrive and old packets are expired out of the window � Periodically scan counters and return the packet types corresponding to the k largest counters (and possibly the actual counts too) IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 3
What if we can’t store the entire window? � Idea from [Zhu, Shasha, VLDB ’02] : � Divide the sliding window into sub-windows, i.e. use a coarser time grain of T packets � Store summary for each sub-window � Every T packets: � Expire oldest sub - w indow � Add most recent sub - window � Update answer window × summary � Space req: T IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 4
Example: windowed SUM SUM = 5 + … + 3 = 97 5 8 4 9 11 6 8 5 3 20 8 7 3 8 4 9 11 6 8 5 3 20 8 7 3 7 SUM = SUM_OLD – 5 + 7 = 99 IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 5
Updating Top-k counters � T b = current count for packet of type b � Update: T b = T b - T b (old sub-window) + T b (new sub-window) � Problem is: T b ( old sub-window ) might not be part of summary in old sub-window IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 6
…but, let’s use the technique anyway � Sub-window summary: IDs and counts of the k most frequent categories = sum of the occurrence count of least � frequent item in summary of each sub- window � Compute overall occurrence count for each packet type from sub-window summaries � Packets exceeding count are reported as top- k IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 7
The algorithm Let a, b, c, … be distinct packet types, let k = 3 � a:17 a:14 d:16 c:22 e:15 b:24 b:21 e:13 c:18 c:12 d:20 f:15 f:17 g: 9 c:6 g:12 f:10 k:12 h:8 f:6 a:8 n:11 a:6 e:13 d:7 d:6 e:4 h:4 a:3 j:3 b:4 c:4 m:6 k:4 b:4 b:4 p:8 h:3 r:5 • = 4+4+3+…+8+3+5 = 56 • Total frequency counts from the top-k lists: a=48, b=57,c=62,d=49,e=45, f=48,g=21,h=12,j=3,k=16,m=6,n=11,p=4,r=5 • Return b and c as frequent items in this window IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 8
Hypothesis � If categories are +/- equally distributed, previous method may not work � But, in a Power Law distribution, we expect a few heavy flows which should register on many top- k lists � Experimented with a TCP trace � 1 month of traffic from Lawrence Berkeley Lab to the rest of the world; almost 800 000 packets in total � 1647 distinct source IP addresses, which we treated as distinct categories IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 9
Results: accuracy Percentage of identified over-threshold items 100 80 60 Percent 40 b=20 size of each 20 b=100 sub-window b=500 0 1 2 3 4 5 6 7 8 9 10 k Window size = 100 000 packets IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 10
Results: precision of the reported frequencies Relative error in the reported frequencies 14 b=20 size of each 12 sub-window Relative error (percent) b=100 10 b=500 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 k Window size = 100 000 packets IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 11
Conclusions � Extended sub-window model to a holistic aggregate � Good results due to the non-uniform distribution of Internet traffic � Low space requirements IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 12
Recommend
More recommend