Algorithms for Big‐Data Management CompSci 590.02 Instructor: Ashwin Machanavajjhala Lecture 1 : 590.02 Spring 13 1
Administrivia hCp://www.cs.duke.edu/courses/spring13/compsci590.2/ • Tue/Thu 3:05 – 4:20 PM • “Reading Course + Project” – No exams! – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual or groups of size 2‐3 • Class Par\cipa\on + assignments (other 50%) • Office hours: by appointment Lecture 1 : 590.02 Spring 13 2
Administrivia • Projects: (50% of grade) – Ideas will be posted in the coming weeks • Goals: – Literature review – Some original research/implementa\on • Timeline (details will be posted on the website soon) – ≤Feb 12: Choose Project (ideas will be posted … new ideas welcome) – Feb 21: Project proposal (1‐4 pages describing the project) – Mar 21: Mid‐project review (2‐3 page report on progress) – Apr 18: Final presenta\ons and submission (6‐10 page conference style paper + 20 minute talk) Lecture 1 : 590.02 Spring 13 3
Why you should take this course? Industry, academic and government research iden\fies the value • of analyzing large data collec\ons in all walks of life. “What Next? A Half‐Dozen Data Management Research Goals for Big – Data and Cloud”, Surajit Chaudhuri, MicrosoO Research – “Big data: The next fronQer for innovaQon, compeQQon, and producQvity”, McKinsey Global InsQtute Report, 2011 Lecture 1 : 590.02 Spring 13 4
Why you should take this course? Very ac\ve field and tons of interes\ng research. • We will read papers in: Data Management – Theory – Machine Learning – … – Lecture 1 : 590.02 Spring 13 5
Why you should take this course? Intro to research by working on a cool project • Read scienQfic papers – Formulate a problem – Perform a scienQfic evaluaQon – Lecture 1 : 590.02 Spring 13 6
Today • Course overview • An algorithm for sampling Lecture 1 : 590.02 Spring 13 7
INTRODUCTION Lecture 1 : 590.02 Spring 13 8
What is Big Data? Lecture 1 : 590.02 Spring 13 9
hCp://visual.ly/what‐big‐data Lecture 1 : 590.02 Spring 13 10
hCp://visual.ly/what‐big‐data Lecture 1 : 590.02 Spring 13 11
3 Key Trends • Increased data collec\on • (Shared nothing) Parallel processing frameworks on commodity hardware • Powerful analysis of trends by linking data from heterogeneous sources Lecture 1 : 590.02 Spring 13 12
Big‐Data impacts all aspects of our life Lecture 1 : 590.02 Spring 13 13
The value in Big‐Data … Recommended links Top Searches Personalized News Interests +43% clicks +79% clicks +250% clicks vs. editor selected vs. randomly selected vs. editorial one size fits all Lecture 1 : 590.02 Spring 13 14
The value in Big‐Data … “ If US healthcare were to use big data creaQvely and effecQvely to drive efficiency and quality, the sector could create more than $300 billion in value every year . ” McKinsey Global Ins\tute Report Lecture 1 : 590.02 Spring 13 15
Example: Google Flu Lecture 1 : 590.02 Spring 13 16
hCp://www.ccs.neu.edu/home/amislove/twiCermood/ Lecture 1 : 590.02 Spring 13 17
Course Overview • Sampling – Reservoir Sampling – Sampling with indices – Sampling from Joins – Markov chain Monte Carlo sampling – Graph Sampling & PageRank Lecture 1 : 590.02 Spring 13 18
Course Overview • Sampling • Streaming Algorithms – Sketches – Online Aggrega\on – Windowed queries – Online learning Lecture 1 : 590.02 Spring 13 19
Course Overview • Sampling • Streaming Algorithms • Parallel Architectures & Algorithms – PRAM – Map Reduce – Graph processing architectures : Bulk Synchronous parallel and asynchronous models – (Graph connec\vity, Matrix Mul\plica\on, Belief Propaga\on) Lecture 1 : 590.02 Spring 13 20
Course Overview • Sampling • Streaming Algorithms • Parallel Architectures & Algorithms • Joining datasets & Record Linkage – Theta Joins: or how to op\mally join two large datasets – Clustering similar documents using minHash – Iden\fying matching users across social networks – Correla\on Clustering – Markov Logic Networks Lecture 1 : 590.02 Spring 13 21
SAMPLING Lecture 1 : 590.02 Spring 13 22
Why Sampling? • Approximately compute quan\\es when – Processing the en\re dataset takes too long. How many tweets menQon Obama? – Computa\on is intractable Number of saQsfying assignments for a DNF. – Do not have access or expensive to get access to en\re data. How many restaurants does Google know about? Number of users in Facebook whose birthday is today. What fracQon of the populaQon has the flu? Lecture 1 : 590.02 Spring 13 23
Zero‐One Es\mator Theorem Input: A universe of items U (e.g., all tweets) A subset G (e.g., tweets men\oning Obama) Goal: Es\mate μ = |G|/|U| Algorithm: • Pick N samples from U {x1, x2, …, xN} • For each sample, let Yi = 1 if xi ε G. • Output: Y = Σ Yi/N Theorem : Let ε < 2. If N > (1/μ) (4 ln(2/δ)/ε 2 ), then Pr[(1‐ε) μ < Y < (1+ε)μ] > 1‐δ Lecture 1 : 590.02 Spring 13 24
Zero‐One Es\mator Theorem Algorithm: • Pick N samples from U {x1, x2, …, xN} • For each sample, let Yi = 1 if xi ε G. • Output: Y = Σ Yi/N Theorem : Let ε < 2. If N > (1/μ) (4 ln(2/δ)/ε 2 ), then Pr[(1‐ε) μ < Y < (1+ε)μ] > 1‐δ Proof: Homework Lecture 1 : 590.02 Spring 13 25
Simple Random Sample • Given a table of size N, pick a subset of n rows, such that each subset of n rows is equally likely. • How to sample n rows? • … if we don’t know N? Lecture 1 : 590.02 Spring 13 26
Reservoir Sampling Highlights: • Make one pass over the data • Maintain a reservoir of n records. • A}er reading t rows, the reservoir is a simple random sample of the first t rows. Lecture 1 : 590.02 Spring 13 27
Reservoir Sampling [ViCer ACM ToMS ‘85] Algorithm R: • Ini\alize reservoir to the first n rows. • For the (t+1) st row R, – Pick a random number m between 1 and t+1 – If m <= n, then replace the m th row in the reservoir with R Lecture 1 : 590.02 Spring 13 28
Proof Lecture 1 : 590.02 Spring 13 29
Proof • If N = n, then P [ row is in sample] = 1. Hence, reservoir contains all the rows in the table. • Suppose for N = t, the reservoir is a simple random sample. That is, each row has n/t chance of appearing in the sample. • For N = t+1: – (t+1)st row is included in the sample with probability n/(t+1) – Any other row: P[ row is in reservoir] = P[ row is in reservoir a}er t steps]* P[ row is not replaced] = n/t * (1‐1/(t+1)) = n/(t+1) Lecture 1 : 590.02 Spring 13 30
Recommend
More recommend