Streaming Algorithms CSE 545 - Spring 2017 Big Data Analytics -- - PowerPoint PPT Presentation

Streaming Algorithms CSE 545 - Spring 2017

Big Data Analytics -- The Class We will learn: ● to analyze different types of data: ○ high dimensional ○ graphs ○ infinite/never-ending ○ labeled ● to use different models of computation: ○ MapReduce ○ streams and online algorithms ○ single machine in-memory ○ Spark J. Leskovec, A.Rajaraman, J.Ullman: Mining of Massive Datasets, www.mmds.org

Motivation One often does not know when a set of data will end. ● Can not store ● Not practical to access repeatedly ● Rapidly arriving ● Does not make sense to ever “insert” into a database Can not fit on disk but would like to generalize / summarize the data?

Motivation One often does not know when a set of data will end. ● Can not store ● Not practical to access repeatedly ● Rapidly arriving ● Does not make sense to ever “insert” into a database Can not fit on disk but would like to generalize / summarize the data? Examples: Google search queries Satellite imagery data Text Messages, Status updates Click Streams

Stream Queries Standing Queries: Stored Ad-Hoc: and permanently executing. One-time questions -- must store expected parts / summaries of streams

Stream Queries Standing Queries: Stored Ad-Hoc: and permanently executing. One-time questions -- must store expected parts / summaries of streams E.g. How would you handle: What is the mean of values seen so far?

We will cover the following algorithms: ● Sampling ● Filtering Data ● Count Distinct Elements ● Counting Moments

General Stream Processing Model Processor Output …, 4, 3, 11, 2, 0, 5, 8, 1, 4 (Generalization, Input stream Summarization) A stream of records (also often referred to as “elements” or “tuples”)

General Stream Processing Model ad-hoc queries Processor Output …, 4, 3, 11, 2, 0, 5, 8, 1, 4 (Generalization, Input stream Summarization)

General Stream Processing Model ad-hoc queries Processor Output …, 4, 3, 11, 2, 0, 5, 8, 1, 4 standing (Generalization, Input stream queries Summarization) limited memory

General Stream Processing Model ad-hoc queries Processor Output …, 4, 3, 11, 2, 0, 5, 8, 1, 4 standing (Generalization, Input stream queries Summarization) limited archival storage memory

Sampling and Filtering Data Sampling: Create a random sample for statistical analysis. Basic Idea: generate random number; if < sample% keep Problem: records/rows usually are not units-of-analysis for statistical analyses

Sampling and Filtering Data Sampling: Create a random sample for statistical analysis. Basic Idea: generate random number; if < sample% keep Problem: records/rows usually are not units-of-analysis for statistical analyses Potential Solution: ● Assume provided some key as unit-of analysis to sample over ○ E.g. ip_address, user_id, document_id, ...etc….

Sampling and Filtering Data Sampling: Create a random sample for statistical analysis. Basic Idea: generate random number; if < sample% keep Problem: records/rows usually are not units-of-analysis for statistical analyses Potential Solution: ● Assume provided some key as unit-of analysis to sample over ○ E.g. ip_address, user_id, document_id, ...etc…. ● Want 1/20th of all “keys” (e.g. users) ○ Hash to 20 buckets; bucket 1 is “in”; others are “out” ○ Note: do not need to store anything (except hash functions); may be part of standing query

Sampling and Filtering Data Filtering: Select elements with property x Example: 40B email addresses to bypass spam filter

Sampling and Filtering Data Filtering: Select elements with property x Example: 40B email addresses to bypass spam filter ● The Bloom Filter ○ Given: ■ |S| keys to filter; will be mapped to |B| bits ■ hashes = h 1 , h 2 , …, h k independent hash functions

Sampling and Filtering Data Filtering: Select elements with property x Example: 40B email addresses to bypass spam filter ● The Bloom Filter (approximates; allows FPs, but not FNs) ○ Given : ■ |S| keys to filter; will be mapped to |B| bits ■ hashes = h 1 , h 2 , …, h k independent hash functions ○ Algorithm set all B to 0 for each i in hashes, for each s in S: set B[ h i (s)] = 1 … #usually embedded in other code while key x arrives next in stream if B[ h i (x)] == 1 for all i in hashes: do as if x is in S else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter ● The Bloom Filter (approximates; allows FPs) ○ Given: ■ |S| keys to filter; will be mapped to |B| bits ■ hashes = h 1, h 2 , …, h k independent hash functions ○ Algorithm set all B to 0 for each i in hashes, for each s in S: set B[ h i (s)] = 1 … #usually embedded in other code while key x arrives next in stream if B[ h i (x)] == 1 for all i in hashes: do as if x is in S else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter What fraction of |B| are 1s? ● The Bloom Filter (approximates; allows FPs) ○ Given : Like throwing |S| * k darts at n ■ |S| keys to filter; will be mapped to |B| bits targets. ■ hashes = h 1, h 2 , …, h k independent hash functions 1 dart: 1/n; ○ Algorithm d darts: (1 - 1/n) d = prob of 0 set all B to 0 = e -d/n faction are 0s for each i in hashes, for each s in S: set B[ h i (s)] = 1 … #usually embedded in other code while key x arrives next in stream if B[ h i (x)] == 1 for all i in hashes: do as if x is in S else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter What fraction of |B| are 1s? ● The Bloom Filter (approximates; allows FPs) ○ Given: Like throwing |S| * k darts at n ■ |S| keys to filter; will be mapped to |B| bits targets. ■ hashes = h 1, h 2 , …, h k independent hash functions 1 dart: 1/n ○ Algorithm d darts: (1 - 1/n) d = prob of 0 set all B to 0 = e -d/n are 0s for each i in hashes, for each s in S: thus, (1 - e -d/n ) are 1s probability all k hashes being 1? set B[ h i (s)] = 1 … #usually embedded in other code while key x arrives next in stream if B[ h i (x)] == 1 for all i in hashes: do as if x is in S else: do as if x not in S

Sampling and Filtering Data What is the probability of a Filtering: Select elements with property x false-positive? Example: 40B email addresses to bypass spam filter What fraction of |B| are 1s? ● The Bloom Filter (approximates; allows FPs) ○ Given: Like throwing |S| * k darts at n ■ |S| keys to filter; will be mapped to |B| bits targets. ■ hashes = h 1, h 2 , …, h k independent hash functions 1 dart: 1/n ○ Algorithm d darts: (1 - 1/n) d = prob of 0 set all B to 0 = e -d/n are 0s for each i in hashes, for each s in S: thus, (1 - e -d/n ) are 1s probability all k hashes being 1? set B[ h i (s)] = 1 (1 - e -(|S|*k)/n ) k … #usually embedded in other code while key x arrives next in stream Note: Can expand S as stream if B[ h i (x)] == 1 for all i in hashes: continues as long as |B| has room do as if x is in S (e.g. adding verified email addresses) else: do as if x not in S

Counting Moments Moments: ● Suppose m i is the count of distinct element i in the data ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance)

Counting Moments Moments: 0th moment One Solution: Just keep a set (hashmap, dictionary, heap) ● Suppose m i is the count of distinct element i in the data Problem: Can’t maintain that many in memory; disk storage is too slow ● The kth moment of the stream is ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance)

Counting Moments Moments: 0th moment Streaming Solution: Flajolet-Martin Algorithm Pick a hash, h, to map each of n elements to log 2 n bits ● Suppose m i is the count of distinct element i in the data R = 0 #potential max number of zeros at tail for each stream element, e: ● The kth moment of the stream is r(e) = num of trailing 0s from h (e) R = r(e) if r(e) > R estimated_distinct_elements = 2 R ● 0th moment: count of distinct elements ● 1st moment: length of stream ● 2nd moment: sum of squares (measures uneveness; related to variance)

Streaming Algorithms CSE 545 - Spring 2017 Big Data Analytics -- - PowerPoint PPT Presentation

Streaming Algorithms CSE 545 - Spring 2017 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to use different models of

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Semi-Streaming Algorithms for Annotated Graph Streams Justin Thaler, Yahoo Labs Data Streaming

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

Streaming Algorithms for Bin Packing and Vector Scheduling Graham Cormode and Pavel Vesel y

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Lecture 3 More on Git Commits Sign in on the attendance sheet! Review: The Git Commit Workflow

Provider Directory Subject Matter Expert Workgroup Meeting #5 May 14, 2014 1 Welcome and

Electronic Mail Overview Electronic mail History Format

Extending the reach of Nicolas Ganivet nicolas@cividesk.com http://www.cividesk.com All your

Examples of Streaming Data Ocean behavior at a point Temperature (once

Lecture #7: M icha el Ba ll The roster is delayed L , so please Higher Order Functions send

Mac Workshop March 2014 Topics Apple Mail Understand how mail works Problems Tips

IMPROVING PRODUCTIVITY AND SECURITY IN AN INSANELY BUSY WORLD PRESENTED BY REID F. TRAUTZ ISBA