identifying frequent items in sliding windows over on
play

Identifying Frequent Items in Sliding Windows over On-Line Packet - PowerPoint PPT Presentation

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams Alejandro Lpez-Ortiz School of Computer Science University of Waterloo Joint work with Lukasz Golab (Waterloo), David DeHaan (Waterloo), Erik Demaine (MIT), and J.


  1. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams Alejandro López-Ortiz School of Computer Science University of Waterloo Joint work with Lukasz Golab (Waterloo), David DeHaan (Waterloo), Erik Demaine (MIT), and J. Ian Munro (Waterloo)

  2. Application � Real-time analysis of network traffic � find frequently appearing packet types � Packet type: port #, protocol type, source IP. � But, interested in recent usage trends � E.g. for routing system analysis or anomaly detection � So, want to find frequently appearing packets in a sliding window of N most recent packets IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 2

  3. If we could store the entire window: � Maintain frequency counts of each category in the window � Update counters as new packets arrive and old packets are expired out of the window � Periodically scan counters and return the packet types corresponding to the k largest counters (and possibly the actual counts too) IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 3

  4. What if we can’t store the entire window? � Idea from [Zhu, Shasha, VLDB ’02] : � Divide the sliding window into sub-windows, i.e. use a coarser time grain of T packets � Store summary for each sub-window � Every T packets: � Expire oldest sub - w indow � Add most recent sub - window � Update answer window × summary � Space req: T IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 4

  5. Example: windowed SUM SUM = 5 + … + 3 = 97 5 8 4 9 11 6 8 5 3 20 8 7 3 8 4 9 11 6 8 5 3 20 8 7 3 7 SUM = SUM_OLD – 5 + 7 = 99 IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 5

  6. Updating Top-k counters � T b = current count for packet of type b � Update: T b = T b - T b (old sub-window) + T b (new sub-window) � Problem is: T b ( old sub-window ) might not be part of summary in old sub-window IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 6

  7. …but, let’s use the technique anyway � Sub-window summary: IDs and counts of the k most frequent categories = sum of the occurrence count of least � frequent item in summary of each sub- window � Compute overall occurrence count for each packet type from sub-window summaries � Packets exceeding count are reported as top- k IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 7

  8. The algorithm Let a, b, c, … be distinct packet types, let k = 3 � a:17 a:14 d:16 c:22 e:15 b:24 b:21 e:13 c:18 c:12 d:20 f:15 f:17 g: 9 c:6 g:12 f:10 k:12 h:8 f:6 a:8 n:11 a:6 e:13 d:7 d:6 e:4 h:4 a:3 j:3 b:4 c:4 m:6 k:4 b:4 b:4 p:8 h:3 r:5 • = 4+4+3+…+8+3+5 = 56 • Total frequency counts from the top-k lists: a=48, b=57,c=62,d=49,e=45, f=48,g=21,h=12,j=3,k=16,m=6,n=11,p=4,r=5 • Return b and c as frequent items in this window IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 8

  9. Hypothesis � If categories are +/- equally distributed, previous method may not work � But, in a Power Law distribution, we expect a few heavy flows which should register on many top- k lists � Experimented with a TCP trace � 1 month of traffic from Lawrence Berkeley Lab to the rest of the world; almost 800 000 packets in total � 1647 distinct source IP addresses, which we treated as distinct categories IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 9

  10. Results: accuracy Percentage of identified over-threshold items 100 80 60 Percent 40 b=20 size of each 20 b=100 sub-window b=500 0 1 2 3 4 5 6 7 8 9 10 k Window size = 100 000 packets IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 10

  11. Results: precision of the reported frequencies Relative error in the reported frequencies 14 b=20 size of each 12 sub-window Relative error (percent) b=100 10 b=500 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 k Window size = 100 000 packets IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 11

  12. Conclusions � Extended sub-window model to a holistic aggregate � Good results due to the non-uniform distribution of Internet traffic � Low space requirements IMC ’03 Miami, Florida Alejandro Lopez-Ortiz 12

Recommend


More recommend