data stream processing
play

Data Stream Processing Part II Stream Windows Lossy Counting - PowerPoint PPT Presentation

Data Stream Processing Part II Stream Windows Lossy Counting Sticky Sampling 1 Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required


  1. Data Stream Processing Part II Stream Windows Lossy Counting Sticky Sampling 1

  2. Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required Stream Windows Lossy Counting Sticky Sampling 2

  3. Reservoir Sampling (recap) Reservoir r/N r/N r/N r/N r/N Stream create representative sample of incoming data items N uniformly sample into reservoir of size r Stream Windows Lossy Counting Sticky Sampling 3

  4. Today Counting algorithms based on stream windows Lossy Counting Sticky Sampling Stream Windows Lossy Counting Sticky Sampling 4

  5. Stream Windows Mechanism for extracting a finite relation from an infinite stream. Stream Windows Lossy Counting Sticky Sampling 5

  6. Window Example past future a d j u w s w y u j g d e d l stream Stream Windows Lossy Counting Sticky Sampling 6

  7. Window Example past future a d j u w s w y u j g d e d l stream a d j u w s w y u j g d e d l stream Stream Windows Lossy Counting Sticky Sampling 7

  8. Window Example past future a d j u w s w y u j g d e d l stream a d j u w s w y u j g d e d l stream a d j u w s w y u j g d e d l stream Stream Windows Lossy Counting Sticky Sampling 8

  9. Window Example past future a d j u w s w y u j g d e d l stream a d j u w s w y u j g d e d l stream a d j u w s w y u j g d e d l stream Sliding Window Stream Windows Lossy Counting Sticky Sampling 9

  10. Window Types assumes existences of some attribute that defines the order of the stream elements (e.g. time) w is the window length (size) expressed in units of the ordering attribute (e.g. seconds) Sliding Window t 1 t 2 t 3 t 4 t 1 ' t 2 ' t 3 ' t 4 ' t i ' - t i = w Stream Windows Lossy Counting Sticky Sampling 10

  11. Window Types assumes existences of some attribute that defines the order of the stream elements (e.g. time) w is the window length (size) expressed in units of the ordering attribute (e.g. seconds) Sliding Window t 1 t 2 t 3 t 4 t 1 ' t 2 ' t 3 ' t 4 ' t i ' - t i = w Tumbling Window t 1 t 2 t 3 t i+1 - t i = w Stream Windows Lossy Counting Sticky Sampling 11

  12. Count based Windows Ordering attribute can cause problems for duplicates (e.g. same time stamps) Use count based windows instead Stream Windows Lossy Counting Sticky Sampling 12

  13. Count based Windows Ordering attribute can cause problems for duplicates (e.g. same time stamps) Use count based windows instead Count based Window t 1 t 2 t 1 ' t 3 t 2 ' t 3 ' Count based windows are potentially unpredicatable with respect to fluctuation in input rates. Stream Windows Lossy Counting Sticky Sampling 13

  14. Punctuation based Windows Split windows based on punctuations in the data Punctuation based Window \n \n \n Stream Windows Lossy Counting Sticky Sampling 14

  15. Punctuation based Windows Split windows based on punctuations in the data Punctuation based Window \n \n \n Potentially problematic if windows grow too large or too small. Stream Windows Lossy Counting Sticky Sampling 15

  16. Window Standing Query Example What is the average of the integers in the window? Stream of integers Window of size w = 4 Count based sliding window for the first w inputs, sum and count afterwards change average by adding ( i − j ) / w to the previous window average Stream Windows Lossy Counting Sticky Sampling 16

  17. Window Standing Query Example 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream Stream Windows Lossy Counting Sticky Sampling 17

  18. Window Standing Query Example 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 1+3+5+4 = 3 . 25 4 stream 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream Stream Windows Lossy Counting Sticky Sampling 18

  19. Window Standing Query Example 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 1+3+5+4 = 3 . 25 4 stream 3 . 25 + i − j w 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 with i newest value, j oldest value stream 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream Stream Windows Lossy Counting Sticky Sampling 19

  20. Window Standing Query Example 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 1+3+5+4 = 3 . 25 4 stream 3 . 25 + i − j w 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 with i newest value, j oldest value stream 1+3+5+4 + 8 − 1 = 5 4 4 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 stream Stream Windows Lossy Counting Sticky Sampling 20

  21. Window Standing Query Example 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 1+3+5+4 = 3 . 25 4 stream 3 . 25 + i − j w 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 with i newest value, j oldest value stream 1+3+5+4 + 8 − 1 = 5 4 4 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 5 + 9 − 3 = 6 . 5 4 stream Stream Windows Lossy Counting Sticky Sampling 21

  22. Window Standing Query Example 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 1+3+5+4 = 3 . 25 4 stream 3 . 25 + i − j w 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 with i newest value, j oldest value stream 1+3+5+4 + 8 − 1 = 5 4 4 1 3 5 4 8 9 3 1 4 2 7 5 6 8 7 5 + 9 − 3 = 6 . 5 4 stream Datastructure? Stream Windows Lossy Counting Sticky Sampling 22

  23. Window Average #!/usr/bin/env python2 import sys import Queue WINDOW = 4 elems = Queue.Queue() elem_sum = 0 for i in range(WINDOW): # initial average val = int(sys.stdin.readline().strip()) elems.put(val) elem_sum += val avg = float(elem_sum) / WINDOW for line in sys.stdin: print(avg) val = int(line.strip()) avg = avg + (val - elems.get())/float(WINDOW) elems.put(val) Stream Windows Lossy Counting Sticky Sampling 23

  24. Window Average #!/usr/bin/env python2 import sys import Queue WINDOW = 4 elems = Queue.Queue() elem_sum = 0 for i in range(WINDOW): # initial average val = int(sys.stdin.readline().strip()) elems.put(val) elem_sum += val avg = float(elem_sum) / WINDOW for line in sys.stdin: print(avg) val = int(line.strip()) avg = avg + (val - elems.get())/float(WINDOW) elems.put(val) Allows calculation in a single pass of each element. Stream Windows Lossy Counting Sticky Sampling 24

  25. Window based Algorithm Lossy Counting Stream Windows Lossy Counting Sticky Sampling 25

  26. Problem Description Maintain a count of distinct elements seen so far Stream Windows Lossy Counting Sticky Sampling 26

  27. Problem Description Maintain a count of distinct elements seen so far Examples: Google web crawler counting URL encounters. Detecting spam pages through content analysis. User login rankings to web services. Stream Windows Lossy Counting Sticky Sampling 27

  28. Problem Description Maintain a count of distinct elements seen so far Examples: Google web crawler counting URL encounters. Detecting spam pages through content analysis. User login rankings to web services. Straight forward solution: Hashtable Stream Windows Lossy Counting Sticky Sampling 28

  29. Problem Description Maintain a count of distinct elements seen so far Examples: Google web crawler counting URL encounters. Detecting spam pages through content analysis. User login rankings to web services. Straight forward solution: Hashtable Too large for memory, too slow on disk Stream Windows Lossy Counting Sticky Sampling 29

  30. Algorithm Parameters Environment Parameters Elements seen so far N User-specified Parameters support threshold s ∈ (0 , 1) error parameter ǫ ∈ (0 , 1) Stream Windows Lossy Counting Sticky Sampling 30

  31. Algorithm Guarantees 1 All items whose true frequency exceeds sN are output. There are no false negatives. 2 No items whose true frequency is less than ( s − ǫ ) N is output. 3 Estimated frequencies are less than the true frequencies by at most ǫ N . Stream Windows Lossy Counting Sticky Sampling 31

  32. Example With s = 10% , ǫ = 1% , N = 1000 Stream Windows Lossy Counting Sticky Sampling 32

  33. Example With s = 10% , ǫ = 1% , N = 1000 1 All elements exceeding frequency sN = 100 will be output. Stream Windows Lossy Counting Sticky Sampling 33

  34. Example With s = 10% , ǫ = 1% , N = 1000 1 All elements exceeding frequency sN = 100 will be output. 2 No elements with frequencies below ( s − ǫ ) N = 90 are output. False positives between 90 and 100 might or might not be output. Stream Windows Lossy Counting Sticky Sampling 34

  35. Example With s = 10% , ǫ = 1% , N = 1000 1 All elements exceeding frequency sN = 100 will be output. 2 No elements with frequencies below ( s − ǫ ) N = 90 are output. False positives between 90 and 100 might or might not be output. 3 All estimated frequencies diverge from their true frequencies by at most ǫ N = 10 instances. Stream Windows Lossy Counting Sticky Sampling 35

  36. Example With s = 10% , ǫ = 1% , N = 1000 1 All elements exceeding frequency sN = 100 will be output. 2 No elements with frequencies below ( s − ǫ ) N = 90 are output. False positives between 90 and 100 might or might not be output. 3 All estimated frequencies diverge from their true frequencies by at most ǫ N = 10 instances. Rule of thumb: ǫ = 0 . 1 s Stream Windows Lossy Counting Sticky Sampling 36

  37. Expected Errors 1 high frequency false positives 2 small errors in frequency estimations Stream Windows Lossy Counting Sticky Sampling 37

  38. Expected Errors 1 high frequency false positives 2 small errors in frequency estimations Acceptable for high numbers of N Stream Windows Lossy Counting Sticky Sampling 38

Recommend


More recommend