7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis - PowerPoint PPT Presentation

CAI: Cerca i Anàlisi d’Informació Grau en Ciència i Enginyeria de Dades, UPC 7. Streaming January 5, 2020 Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldà, Department of Computer Science, UPC 1 / 73

Contents 7. Streaming Data streams everywhere The data stream model Sampling Counting Items Counting Distinct Items Keeping Frequent Elements Counting in Sliding Windows Distributed Sketching References and Resources 2 / 73

Data streams everywhere ◮ Telcos - phone calls ◮ Satellite, radar, sensor data ◮ Computer systems and network monitoring ◮ Search logs, access logs ◮ RSS feeds, social network activity ◮ Websites, clickstreams, query streams ◮ E-commerce, credit card sales ◮ . . . 3 / 73

Example 1: Online shop Thousands of visits / day ◮ Is this “customer” a robot? ◮ Does this customer want to buy? ◮ Is customer lost? Finding what s/he wants? ◮ What products should we recommend to this user? ◮ What ads should we show to this user? ◮ Should we get more machines from the cloud to handle incoming traffic? 4 / 73

Example 2: Web searchers Millions of queries / day ◮ What are the top queries right now? ◮ Which terms are gaining popularity now? ◮ What ads should we show for this query and user? 5 / 73

Example 3: Phone company Hundreds of millions of calls/day ◮ Each call about 1000 bytes per switch ◮ I.e., about 1Tb/month; must keep for billing ◮ Is this call fraudulent? ◮ Why do we get so many call drops in area X? ◮ Should we reroute differently tomorrow? ◮ Is this customer thinking of leaving us? ◮ How to cross-sell / up-sell this customer? 6 / 73

Example 4: Network link Several Gb /minute at UPC’s outlink Really impossible to store ◮ Detect abusive users ◮ Detect anomalous traffic patterns ◮ . . . DDOS attacks, intrusions, etc. 7 / 73

Others ◮ Social networks: Planet-scale streams ◮ Smart cities. Smart vehicles ◮ Internet of Things ◮ (more phones connected to devices than used by humans) ◮ Open data; governmental and scientific ◮ We generate far more data than we can store 8 / 73

Data Streams: Modern times data ◮ Data arrives as sequence of items ◮ At high speed ◮ Forever ◮ Can’t store them all ◮ Can’t go back; or too slow https: ◮ Evolving, non-stationary reality //www.youtube.com/ watch?v=ANXGJe6i3G8 9 / 73

In algorithmic words. . . The Data Stream axioms: 1. One pass 2. Low time per item - read, process, discard 3. Sublinear memory - only summaries or sketches 4. Anytime, real-time answers 5. The stream evolves over time 10 / 73

Computing in data streams ◮ Approximate answers are often OK ◮ Specifically, in learning and mining contexts ◮ Often computable with surprisingly low memory, one pass 11 / 73

Main Ingredients: Approximation and Randomization ◮ Algorithms use a source of independent random bits ◮ So different runs give different outputs ◮ But “most runs” are “approximately correct” 12 / 73

Randomized Algorithms ( ǫ, δ ) -approximation A randomized algorithm A ( ǫ, δ ) -approximates a function f : X → R iff for every x ∈ X , with probability ≥ 1 − δ ◮ (absolute approximation) | A ( x ) − f ( x ) | < ǫ ◮ (relative approximation) | A ( x ) − f ( x ) | < ǫ f ( x ) Often ǫ , δ given as inputs to A ǫ = accuracy; δ = confidence 13 / 73

Randomized Algorithms In traditional statistics one roughly describes a random variable X by giving µ = E [ X ] and σ 2 = Var ( X ) . Obtaining ( ǫ, δ ) -approximations For any X , there is an algorithm that takes m independent samples of X and outputs an estimate ˆ µ such that Pr [ | ˆ µ − µ | ≤ ǫ ] ≥ 1 − δ for � σ 2 � ǫ 2 · ln 1 m = O δ This is general. For specific X there may be more sample-efficient methods. (Proof omitted; ask if interested). 14 / 73

Five problems on Data Streams ◮ Keeping a uniform sample ◮ Counting total elements ◮ Counting distinct elements ◮ Counting frequent elements - heavy hitters ◮ Counting in a sliding window The solutions are interesting not only in streaming mode. But whenever you want to reduce memory. 15 / 73

Sampling: Dealing with Velocity At time t , process element t with probability α Compute your query on the sampled elements only You process about αt elements instead of t , then extrapolate. 16 / 73

Sampling: Dealing with Velocity AND Memory Similar problem: Keep a uniform sample S of elements of some size k At every time t , each of the first t elements is in S with probability k/t How to make early elements as likely to be in S as later elements? 17 / 73

Reservoir Sampling Reservoir Sampling [Vitter85] ◮ Add the first k stream elements to S ◮ Choose to keep t -th item with probability k/t ◮ If chosen, replace one element from S at random 18 / 73

Reservoir Sampling: why does it work? Claim: for every t , for every i ≤ t , P i,t = Pr [ s i in sample at time t ] = k/t Suppose true at time t . At time t + 1 , P t +1 ,t +1 = Pr [ s t +1 sampled ] = k/ ( t + 1) and for i ≤ t , s i is in the sample if it was before, and not ( s t +1 sampled and it kicks out exactly s i ) � � � � P i,t +1 = k t + 1 · 1 k = k 1 t · 1 − t · 1 − k t + 1 = k t k t · t + 1 = t + 1 19 / 73

Counting Items

Counting Items How many items have we read so far? To count up to t elements exactly , log t bits are necessary Morris’s counter: Count approximately using log log t bits Can count up to 1 billion with log log 10 9 = 5 bits 21 / 73

Approximate counting: Saving 1 bit Approximate counting, v1 Init: c ← 0 Update: draw a random number x ∈ [0 , 1] if ( x ≤ 1 / 2 ) c ← c + 1 Query: return 2 c � E [2 c ] = t , σ ≃ t/ 2 Space log( t/ 2) = log t − 1 → we saved 1 bit! 22 / 73

Approximate counting: Saving k bits Approximate counting, v2 Init: c ← 0 Update: draw a random number x ∈ [0 , 1] if ( x ≤ 2 − k ) c ← c + 1 Query: return 2 k c � E [ c ] = t/ 2 k , σ ≃ t/ 2 k Memory log t − k → we saved k bits! x ≤ 2 − k : AND of k random bits, log k memory 23 / 73

Approximate counting: Morris’ counter Morris’ counter [Morris77] Init: c ← 0 Update: draw a random number x ∈ [0 , 1] if ( x ≤ 2 − c ) c ← c + 1 Query: return 2 c − 2 √ E [2 c − 2] = t , E [ c ] ≃ log t , σ ≃ t/ 2 Memory = bits used to hold c = log c = log log t bits 24 / 73

Morris’ approximate counter From High Performance Python , M. Gorelick & I. Oswald. O’Reilly 2014 25 / 73

Morris’ approximate counter Problem: large variance, σ ≃ 0 . 7 t 26 / 73

Reducing the variance, method I ◮ Run r parallel, independent copies of the algorithm ◮ On Query, average their estimates √ ◮ E [ Query ] ≃ t , σ ≃ t/ 2 r ◮ Space r log log t ◮ Time per item multiplied by r 27 / 73

Reducing the variance, method II Use basis b < 2 instead of basis 2 : ◮ Places t in the series 1 , b, b 2 , . . . , b i , . . . (“resolution” b ) ◮ E [ b c ] ≃ t , σ ≃ � ( b − 1) / 2 · t ◮ Space log log t − log log b ( > log log t , because b < 2 ) ◮ For b = 1 . 08 , 3 extra bits, σ ≃ 0 . 2 t 28 / 73

Counting Distinct Elements The Distinct Element Counting Problem How many different elements have we seen so far in the data stream? 29 / 73

Motivation Item spaces and # distinct elements can be large ◮ I’m a web searcher. How many different queries did I get? ◮ I’m a router. How many pairs (sourceIP ,destinationIP) have I seen? ◮ itemspace: potentially 2 128 in IPv6 ◮ I’m a text message service. How many distinct messages have I seen? ◮ itemspace: essentially infinite ◮ I’m an streaming classifier builder. How many distinct values have I seen for this attribute x ? 30 / 73

Counting distinct elements ◮ Item space I , cardinality n , identified with range [ n ] ◮ f i,t = # occurrences of i ∈ I among first t stream elements ◮ d t = number of i ’s for which f i,t > 0 ◮ Often omit subindex t 31 / 73

Counting distinct elements Solving exactly requires O ( d ) memory Approximate solutions: ◮ Bloom Filters: O ( d ) bits ◮ Cohen’s filter: O (log d ) bits ◮ HyperLogLog O (log log d ) bits 32 / 73

Probabilistic Counting [Flajolet-Martin 85] Choose a “good” hash function f : Items → [0 ..m − 1] Apply f to each item in stream, f ( i ) Observe the first bits of f ( i ) Idea: To see f ( i ) = 0 k − 1 1 . . . , we must have seen 2 k distinct values (Think why!) Algorithm: Keep track of the smallest k seen so far 33 / 73

Flajolet-Martin probabilistic counter Init: p ← 0 Update( x ): ◮ let b be the position of the leftmost 1 bit of f ( x ) ◮ if ( b > p ) p ← b Query: return 2 p E [2 p ] = d/ϕ , for a constant ϕ = 0 . 77 . . . Memory = (bits to store p ) = log p = log log d max bits 34 / 73

Flajolet-Martin: reducing the variance Solution 1: Use r independent copies, then average ◮ Problem 1: runtime multiplied by r ◮ Problem 2: independent runs = generate independent hash functions ◮ And we don’t know how to generate several independent hash functions Note: I am skipping actually the tricky issue of “good hash functions” 35 / 73

7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis - PowerPoint PPT Presentation

CAI: Cerca i Anlisi dInformaci Grau en Cincia i Enginyeria de Dades, UPC 7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis Balczar, Ramon Ferrer-i-Cancho, Ricard Gavald, Department of Computer Science, UPC 1 / 73

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Software Streaming via Block Streaming Pramote Kuacharoen*, Vincent J. Mooney III + and Vijay K.

Streaming XML With Jabber/XMPP Ralph Meijer and Peter Saint-Andre Streaming XML With Jabber/XMPP

Embedded Software Streaming Embedded Software Streaming via Block Stream via Block Stream A

Streaming and storing CineGrid data: A study on optimization methods Sevickson.Kwidama os3.nl

Semi-Streaming Algorithms for Annotated Graph Streams Justin Thaler, Yahoo Labs Data Streaming

Locator/ID Separation: Study on the cost of Mappings Caching and Mappings Lookups Technical

Reconsidering the Internet Routing Architecture Olivier Bonaventure Universit catholique de

Krivines Classical Realizability from a Categorical Perspective Thomas Streicher (TU

Office of Probation & Correctional Alternatives Sixth Annual Ignition Interlock

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to

Exploiting performance measures from advanced SAS systems for autonomous MCM operations Torstein

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta

libdmclient An Open Source implementation of OMA-DM David Navarro FOSDEM 2013 What is Device

7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis - PowerPoint PPT Presentation

CAI: Cerca i Anlisi dInformaci Grau en Cincia i Enginyeria de Dades, UPC 7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis Balczar, Ramon Ferrer-i-Cancho, Ricard Gavald, Department of Computer Science, UPC 1 / 73

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Software Streaming via Block Streaming Pramote Kuacharoen*, Vincent J. Mooney III + and Vijay K.

Streaming XML With Jabber/XMPP Ralph Meijer and Peter Saint-Andre Streaming XML With Jabber/XMPP

Embedded Software Streaming Embedded Software Streaming via Block Stream via Block Stream A

Streaming and storing CineGrid data: A study on optimization methods Sevickson.Kwidama os3.nl

Semi-Streaming Algorithms for Annotated Graph Streams Justin Thaler, Yahoo Labs Data Streaming

Locator/ID Separation: Study on the cost of Mappings Caching and Mappings Lookups Technical

Reconsidering the Internet Routing Architecture Olivier Bonaventure Universit catholique de

Krivines Classical Realizability from a Categorical Perspective Thomas Streicher (TU

Office of Probation &amp; Correctional Alternatives Sixth Annual Ignition Interlock

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to

Exploiting performance measures from advanced SAS systems for autonomous MCM operations Torstein

PAN@FIRE 2013: Overview of the Cross-Language !ndian News Story Search (CL!NSS) Track Parth Gupta

libdmclient An Open Source implementation of OMA-DM David Navarro FOSDEM 2013 What is Device

Office of Probation & Correctional Alternatives Sixth Annual Ignition Interlock