data stream processing
play

Data Stream Processing Part I Motivation Data Streams Reservoir - PowerPoint PPT Presentation

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due this Friday the 20th of October Motivation Data Streams Reservoir Sampling 2 Data Processing so far ... Input Document Output Document


  1. Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1

  2. Homework 1 is due this Friday the 20th of October Motivation Data Streams Reservoir Sampling 2

  3. Data Processing so far ... Input Document Output Document Motivation Data Streams Reservoir Sampling 3

  4. Sensor Data Example ºC ºC ºC Input time Document one 4 byte real 96 bytes per hour per day Motivation Data Streams Reservoir Sampling 4

  5. Sensor Data Example Input time Document one 4 byte real 3.5 Mb every 100 ms per day Motivation Data Streams Reservoir Sampling 5

  6. Sensor Data Example Input Document time one million 4 byte reals 3.5 Tb every 100 ms per day Motivation Data Streams Reservoir Sampling 6

  7. Sensor Data Example Stream of large unbounded data too large for memory too high latency for disk We need real time processing! Motivation Data Streams Reservoir Sampling 7

  8. Sensor Data Example Input time Document Process data stream directly Motivation Data Streams Reservoir Sampling 8

  9. Data Streams Motivation Data Streams Reservoir Sampling 9

  10. What is a Data Stream? Definition (Golab and Ozsu, 2003 A data stream is a real-time, continuous, ordered (implicitly by arrival time of explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety. Motivation Data Streams Reservoir Sampling 10

  11. What is a Data Stream? Definition (Golab and Ozsu, 2003 A data stream is a real-time, continuous, ordered (implicitly by arrival time of explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety. continous and sequential input typically unpredictable input rate can be large amounts of data not error free Motivation Data Streams Reservoir Sampling 11

  12. Data Stream Applications Online, real time processing Event detection and reaction Aggregation Approximation Motivation Data Streams Reservoir Sampling 12

  13. Data Stream Example Stock monitoring Motivation Data Streams Reservoir Sampling 13

  14. Data Stream Example Stock monitoring Website traffic monitoring Motivation Data Streams Reservoir Sampling 14

  15. Data Stream Example Stock monitoring Website traffic monitoring Network management Motivation Data Streams Reservoir Sampling 15

  16. Data Stream Example Stock monitoring Website traffic monitoring Network management Highway traffic Motivation Data Streams Reservoir Sampling 16

  17. Data Stream Characteristics Motivation Data Streams Reservoir Sampling 17

  18. Data Stream Characteristics All items have the same structure. For example a tuple or object: (sender, recipient, text body) Motivation Data Streams Reservoir Sampling 18

  19. Data Stream Characteristics All items have the same structure. For example a tuple or object: (sender, recipient, text body) timestamps: explicite vs. implicite, physical vs. logical Motivation Data Streams Reservoir Sampling 19

  20. Database Management vs. Data Stream Management Motivation Data Streams Reservoir Sampling 20

  21. DBMS vs. DSMS Feature DBMS DSMS Model persistent relation transient relation Relation tuple set/bag tuple sequence Data update modifications appends Query transient persistent Query answer exact approximate Query evaluation arbitrary one pass Query plan fixed adaptive Motivation Data Streams Reservoir Sampling 21

  22. DSMS Architecture Motivation Data Streams Reservoir Sampling 22

  23. Data Stream Mining Motivation Data Streams Reservoir Sampling 23

  24. Data Stream Mining event detection and reaction counting frequency of specific items pattern detection aggregation approximation sampling Motivation Data Streams Reservoir Sampling 24

  25. Data Stream Mining event detection and reaction counting frequency of specific items pattern detection aggregation approximation sampling Motivation Data Streams Reservoir Sampling 25

  26. Resevoir Sampling Motivation Data Streams Reservoir Sampling 26

  27. Problem: Sampling Lines from a large text file Stream: Sample search engine queries, updated live Motivation Data Streams Reservoir Sampling 27

  28. The Simple Way 1 Scan the text file, counting lines 2 Generate random line numbers [0 , | lines | ) 3 Sort the line numbers 4 Scan the text file, outputting selected lines Motivation Data Streams Reservoir Sampling 28

  29. The Simple Way 1 Scan the text file, counting lines 2 Generate random line numbers [0 , | lines | ) 3 Sort the line numbers 4 Scan the text file, outputting selected lines Cost: two scans Motivation Data Streams Reservoir Sampling 29

  30. The Simple Way 1 Scan the text file, counting lines 2 Generate random line numbers [0 , | lines | ) 3 Sort the line numbers 4 Scan the text file, outputting selected lines Cost: two scans Impossible / Impractical for stream Motivation Data Streams Reservoir Sampling 30

  31. The Simple Way for a Stream Problem: Sample top 1000 queries 1 assign each query a random number 2 keep the queries with the top 1000 highest random numbers 3 discard the rest Motivation Data Streams Reservoir Sampling 31

  32. The Simple Way for a Stream Problem: Sample top 1000 queries 1 assign each query a random number 2 keep the queries with the top 1000 highest random numbers 3 discard the rest Additional storage required for random numbers. Motivation Data Streams Reservoir Sampling 32

  33. The Simple Way for a Stream Problem: Sample top 1000 queries 1 assign each query a random number 2 keep the queries with the top 1000 highest random numbers 3 discard the rest Additional storage required for random numbers. So far not reservoir sampling! Motivation Data Streams Reservoir Sampling 33

  34. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: Motivation Data Streams Reservoir Sampling 34

  35. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 Motivation Data Streams Reservoir Sampling 35

  36. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 keep 2nd line: Motivation Data Streams Reservoir Sampling 36

  37. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 keep 2nd line: Motivation Data Streams Reservoir Sampling 37

  38. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 keep 2nd line: 1 2 Motivation Data Streams Reservoir Sampling 38

  39. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 keep 2nd line: 1 2 keep 3rd line: 1 2 Motivation Data Streams Reservoir Sampling 39

  40. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 keep 2nd line: 1 2 keep 3rd line: 1 2 keep nth line: Motivation Data Streams Reservoir Sampling 40

  41. Sample One Line Probability of keeping a line and dropping all others? keep 1st line: 1 keep 2nd line: 1 2 keep 3rd line: 1 2 keep nth line: 1 2 Motivation Data Streams Reservoir Sampling 41

  42. Sample One Line Flip a coin at each line. If it’s heads, record the line (and forget the others). #!/usr/bin/env python import sys import random resevoir = sys.stdin.readline().strip() for line in sys.stdin: if random.randint(0,1) == 0: resevoir = line.strip() print(resevoir) Motivation Data Streams Reservoir Sampling 42

  43. Sample One Line Flip a coin at each line. If it’s heads, record the line (and forget the others). #!/usr/bin/env python import sys import random resevoir = sys.stdin.readline().strip() for line in sys.stdin: if random.randint(0,1) == 0: resevoir = line.strip() print(resevoir) This is biased. The last line has probability 1 2 . Motivation Data Streams Reservoir Sampling 43

  44. Sample One Line Flip a coin at each line. If it’s heads, record the line (and forget the others). #!/usr/bin/env python import sys import random resevoir = sys.stdin.readline().strip() for line in sys.stdin: if random.randint(0,1) == 0: resevoir = line.strip() print(resevoir) This is biased. The last line has probability 1 2 . It should be the same probability for each line! Motivation Data Streams Reservoir Sampling 44

  45. Uniformly Sample One Line keep 1st line: 1 keep 2nd line: keep 3rd line: keep nth line: Motivation Data Streams Reservoir Sampling 45

  46. Uniformly Sample One Line keep 1st line: 1 keep 2nd line: 1 2 keep 3rd line: keep nth line: Motivation Data Streams Reservoir Sampling 46

  47. Uniformly Sample One Line keep 1st line: 1 keep 2nd line: 1 2 keep 3rd line: 1 3 keep nth line: Motivation Data Streams Reservoir Sampling 47

  48. Uniformly Sample One Line keep 1st line: 1 keep 2nd line: 1 2 keep 3rd line: 1 3 keep nth line: 1 n Motivation Data Streams Reservoir Sampling 48

  49. Uniformly Sample One Line 1/1 keep 1st line: 1 1/2 1/2 keep 2nd line: 1 2 keep 3rd line: 1 3 keep nth line: 1 1/3 1/3 1/3 n 1/n 1/n 1/n 1/n Motivation Data Streams Reservoir Sampling 49

Recommend


More recommend