CAI: Cerca i Anàlisi d’Informació Grau en Ciència i Enginyeria de Dades, UPC 7. Streaming January 5, 2020 Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldà, Department of Computer Science, UPC 1 / 73
Contents 7. Streaming Data streams everywhere The data stream model Sampling Counting Items Counting Distinct Items Keeping Frequent Elements Counting in Sliding Windows Distributed Sketching References and Resources 2 / 73
Data streams everywhere ◮ Telcos - phone calls ◮ Satellite, radar, sensor data ◮ Computer systems and network monitoring ◮ Search logs, access logs ◮ RSS feeds, social network activity ◮ Websites, clickstreams, query streams ◮ E-commerce, credit card sales ◮ . . . 3 / 73
Example 1: Online shop Thousands of visits / day ◮ Is this “customer” a robot? ◮ Does this customer want to buy? ◮ Is customer lost? Finding what s/he wants? ◮ What products should we recommend to this user? ◮ What ads should we show to this user? ◮ Should we get more machines from the cloud to handle incoming traffic? 4 / 73
Example 2: Web searchers Millions of queries / day ◮ What are the top queries right now? ◮ Which terms are gaining popularity now? ◮ What ads should we show for this query and user? 5 / 73
Example 3: Phone company Hundreds of millions of calls/day ◮ Each call about 1000 bytes per switch ◮ I.e., about 1Tb/month; must keep for billing ◮ Is this call fraudulent? ◮ Why do we get so many call drops in area X? ◮ Should we reroute differently tomorrow? ◮ Is this customer thinking of leaving us? ◮ How to cross-sell / up-sell this customer? 6 / 73
Example 4: Network link Several Gb /minute at UPC’s outlink Really impossible to store ◮ Detect abusive users ◮ Detect anomalous traffic patterns ◮ . . . DDOS attacks, intrusions, etc. 7 / 73
Others ◮ Social networks: Planet-scale streams ◮ Smart cities. Smart vehicles ◮ Internet of Things ◮ (more phones connected to devices than used by humans) ◮ Open data; governmental and scientific ◮ We generate far more data than we can store 8 / 73
Data Streams: Modern times data ◮ Data arrives as sequence of items ◮ At high speed ◮ Forever ◮ Can’t store them all ◮ Can’t go back; or too slow https: ◮ Evolving, non-stationary reality //www.youtube.com/ watch?v=ANXGJe6i3G8 9 / 73
In algorithmic words. . . The Data Stream axioms: 1. One pass 2. Low time per item - read, process, discard 3. Sublinear memory - only summaries or sketches 4. Anytime, real-time answers 5. The stream evolves over time 10 / 73
Computing in data streams ◮ Approximate answers are often OK ◮ Specifically, in learning and mining contexts ◮ Often computable with surprisingly low memory, one pass 11 / 73
Main Ingredients: Approximation and Randomization ◮ Algorithms use a source of independent random bits ◮ So different runs give different outputs ◮ But “most runs” are “approximately correct” 12 / 73
Randomized Algorithms ( ǫ, δ ) -approximation A randomized algorithm A ( ǫ, δ ) -approximates a function f : X → R iff for every x ∈ X , with probability ≥ 1 − δ ◮ (absolute approximation) | A ( x ) − f ( x ) | < ǫ ◮ (relative approximation) | A ( x ) − f ( x ) | < ǫ f ( x ) Often ǫ , δ given as inputs to A ǫ = accuracy; δ = confidence 13 / 73
Randomized Algorithms In traditional statistics one roughly describes a random variable X by giving µ = E [ X ] and σ 2 = Var ( X ) . Obtaining ( ǫ, δ ) -approximations For any X , there is an algorithm that takes m independent samples of X and outputs an estimate ˆ µ such that Pr [ | ˆ µ − µ | ≤ ǫ ] ≥ 1 − δ for � σ 2 � ǫ 2 · ln 1 m = O δ This is general. For specific X there may be more sample-efficient methods. (Proof omitted; ask if interested). 14 / 73
Five problems on Data Streams ◮ Keeping a uniform sample ◮ Counting total elements ◮ Counting distinct elements ◮ Counting frequent elements - heavy hitters ◮ Counting in a sliding window The solutions are interesting not only in streaming mode. But whenever you want to reduce memory. 15 / 73
Sampling: Dealing with Velocity At time t , process element t with probability α Compute your query on the sampled elements only You process about αt elements instead of t , then extrapolate. 16 / 73
Sampling: Dealing with Velocity AND Memory Similar problem: Keep a uniform sample S of elements of some size k At every time t , each of the first t elements is in S with probability k/t How to make early elements as likely to be in S as later elements? 17 / 73
Reservoir Sampling Reservoir Sampling [Vitter85] ◮ Add the first k stream elements to S ◮ Choose to keep t -th item with probability k/t ◮ If chosen, replace one element from S at random 18 / 73
Reservoir Sampling: why does it work? Claim: for every t , for every i ≤ t , P i,t = Pr [ s i in sample at time t ] = k/t Suppose true at time t . At time t + 1 , P t +1 ,t +1 = Pr [ s t +1 sampled ] = k/ ( t + 1) and for i ≤ t , s i is in the sample if it was before, and not ( s t +1 sampled and it kicks out exactly s i ) � � � � P i,t +1 = k t + 1 · 1 k = k 1 t · 1 − t · 1 − k t + 1 = k t k t · t + 1 = t + 1 19 / 73
Counting Items
Counting Items How many items have we read so far? To count up to t elements exactly , log t bits are necessary Morris’s counter: Count approximately using log log t bits Can count up to 1 billion with log log 10 9 = 5 bits 21 / 73
Approximate counting: Saving 1 bit Approximate counting, v1 Init: c ← 0 Update: draw a random number x ∈ [0 , 1] if ( x ≤ 1 / 2 ) c ← c + 1 Query: return 2 c � E [2 c ] = t , σ ≃ t/ 2 Space log( t/ 2) = log t − 1 → we saved 1 bit! 22 / 73
Approximate counting: Saving k bits Approximate counting, v2 Init: c ← 0 Update: draw a random number x ∈ [0 , 1] if ( x ≤ 2 − k ) c ← c + 1 Query: return 2 k c � E [ c ] = t/ 2 k , σ ≃ t/ 2 k Memory log t − k → we saved k bits! x ≤ 2 − k : AND of k random bits, log k memory 23 / 73
Approximate counting: Morris’ counter Morris’ counter [Morris77] Init: c ← 0 Update: draw a random number x ∈ [0 , 1] if ( x ≤ 2 − c ) c ← c + 1 Query: return 2 c − 2 √ E [2 c − 2] = t , E [ c ] ≃ log t , σ ≃ t/ 2 Memory = bits used to hold c = log c = log log t bits 24 / 73
Morris’ approximate counter From High Performance Python , M. Gorelick & I. Oswald. O’Reilly 2014 25 / 73
Morris’ approximate counter Problem: large variance, σ ≃ 0 . 7 t 26 / 73
Reducing the variance, method I ◮ Run r parallel, independent copies of the algorithm ◮ On Query, average their estimates √ ◮ E [ Query ] ≃ t , σ ≃ t/ 2 r ◮ Space r log log t ◮ Time per item multiplied by r 27 / 73
Reducing the variance, method II Use basis b < 2 instead of basis 2 : ◮ Places t in the series 1 , b, b 2 , . . . , b i , . . . (“resolution” b ) ◮ E [ b c ] ≃ t , σ ≃ � ( b − 1) / 2 · t ◮ Space log log t − log log b ( > log log t , because b < 2 ) ◮ For b = 1 . 08 , 3 extra bits, σ ≃ 0 . 2 t 28 / 73
Counting Distinct Elements The Distinct Element Counting Problem How many different elements have we seen so far in the data stream? 29 / 73
Motivation Item spaces and # distinct elements can be large ◮ I’m a web searcher. How many different queries did I get? ◮ I’m a router. How many pairs (sourceIP ,destinationIP) have I seen? ◮ itemspace: potentially 2 128 in IPv6 ◮ I’m a text message service. How many distinct messages have I seen? ◮ itemspace: essentially infinite ◮ I’m an streaming classifier builder. How many distinct values have I seen for this attribute x ? 30 / 73
Counting distinct elements ◮ Item space I , cardinality n , identified with range [ n ] ◮ f i,t = # occurrences of i ∈ I among first t stream elements ◮ d t = number of i ’s for which f i,t > 0 ◮ Often omit subindex t 31 / 73
Counting distinct elements Solving exactly requires O ( d ) memory Approximate solutions: ◮ Bloom Filters: O ( d ) bits ◮ Cohen’s filter: O (log d ) bits ◮ HyperLogLog O (log log d ) bits 32 / 73
Probabilistic Counting [Flajolet-Martin 85] Choose a “good” hash function f : Items → [0 ..m − 1] Apply f to each item in stream, f ( i ) Observe the first bits of f ( i ) Idea: To see f ( i ) = 0 k − 1 1 . . . , we must have seen 2 k distinct values (Think why!) Algorithm: Keep track of the smallest k seen so far 33 / 73
Flajolet-Martin probabilistic counter Init: p ← 0 Update( x ): ◮ let b be the position of the leftmost 1 bit of f ( x ) ◮ if ( b > p ) p ← b Query: return 2 p E [2 p ] = d/ϕ , for a constant ϕ = 0 . 77 . . . Memory = (bits to store p ) = log p = log log d max bits 34 / 73
Flajolet-Martin: reducing the variance Solution 1: Use r independent copies, then average ◮ Problem 1: runtime multiplied by r ◮ Problem 2: independent runs = generate independent hash functions ◮ And we don’t know how to generate several independent hash functions Note: I am skipping actually the tricky issue of “good hash functions” 35 / 73
Recommend
More recommend