AlgorithmsforBigData Management CompSci590.02 - PowerPoint PPT Presentation

Algorithms for Big‐Data  Management  CompSci 590.02  Instructor: Ashwin Machanavajjhala  Lecture 1 : 590.02 Spring 13  1 

Administrivia  hCp://www.cs.duke.edu/courses/spring13/compsci590.2/  • Tue/Thu 3:05 – 4:20 PM   • “Reading Course + Project”  – No exams!  – Every class based on 1 (or 2) assigned papers that students  must  read.  • Projects: (50% of grade)  – Individual or groups of size 2‐3  • Class Par\cipa\on + assignments (other 50%)  • Office hours: by appointment  Lecture 1 : 590.02 Spring 13  2 

Administrivia  • Projects: (50% of grade)  – Ideas will be posted in the coming weeks  • Goals:  – Literature review  – Some original research/implementa\on  • Timeline (details will be posted on the website soon)  – ≤Feb 12: Choose Project (ideas will be posted … new ideas welcome)  – Feb 21: Project proposal (1‐4 pages describing the project)  – Mar 21: Mid‐project review (2‐3 page report on progress)  – Apr 18: Final presenta\ons and submission (6‐10 page conference style paper  + 20 minute talk)  Lecture 1 : 590.02 Spring 13  3 

Why you should take this course?  Industry, academic and government research iden\fies the value  • of analyzing large data collec\ons in all walks of life.   “What Next? A Half‐Dozen Data Management Research Goals for Big  – Data and Cloud”, Surajit Chaudhuri, MicrosoO Research  – “Big data: The next fronQer for innovaQon, compeQQon, and  producQvity”, McKinsey Global InsQtute Report, 2011  Lecture 1 : 590.02 Spring 13  4 

Why you should take this course?  Very ac\ve field and tons of interes\ng research.   • We will read papers in:  Data Management  – Theory   – Machine Learning  – …  – Lecture 1 : 590.02 Spring 13  5 

Why you should take this course?  Intro to research by working on a cool project  • Read scienQfic papers  – Formulate a problem  – Perform a scienQfic evaluaQon  – Lecture 1 : 590.02 Spring 13  6 

Today  • Course overview  • An algorithm for sampling  Lecture 1 : 590.02 Spring 13  7 

INTRODUCTION  Lecture 1 : 590.02 Spring 13  8 

What is Big Data?  Lecture 1 : 590.02 Spring 13  9 

hCp://visual.ly/what‐big‐data  Lecture 1 : 590.02 Spring 13  10 

hCp://visual.ly/what‐big‐data  Lecture 1 : 590.02 Spring 13  11 

3 Key Trends  • Increased data collec\on  • (Shared nothing) Parallel processing frameworks on commodity  hardware  • Powerful analysis of trends by linking data from heterogeneous  sources  Lecture 1 : 590.02 Spring 13  12 

Big‐Data impacts all aspects of our life   Lecture 1 : 590.02 Spring 13  13 

The value in Big‐Data …  Recommended links   Top Searches   Personalized   News Interests   +43% clicks +79% clicks +250% clicks vs. editor selected vs. randomly selected vs. editorial one size fits all Lecture 1 : 590.02 Spring 13  14 

The value in Big‐Data …  “ If  US healthcare  were to use  big data   creaQvely and effecQvely to drive efficiency and  quality, the sector could create more than  $300 billion in value every year .  ”  McKinsey Global Ins\tute Report  Lecture 1 : 590.02 Spring 13  15 

Example: Google Flu  Lecture 1 : 590.02 Spring 13  16 

hCp://www.ccs.neu.edu/home/amislove/twiCermood/  Lecture 1 : 590.02 Spring 13  17 

Course Overview  • Sampling   – Reservoir Sampling  – Sampling with indices  – Sampling from Joins  – Markov chain Monte Carlo sampling  – Graph Sampling & PageRank  Lecture 1 : 590.02 Spring 13  18 

Course Overview  • Sampling   • Streaming Algorithms   – Sketches  – Online Aggrega\on  – Windowed queries  – Online learning  Lecture 1 : 590.02 Spring 13  19 

Course Overview  • Sampling   • Streaming Algorithms  • Parallel Architectures & Algorithms  – PRAM  – Map Reduce  – Graph processing architectures : Bulk Synchronous parallel and  asynchronous models  – (Graph connec\vity, Matrix Mul\plica\on, Belief Propaga\on)  Lecture 1 : 590.02 Spring 13  20 

Course Overview  • Sampling   • Streaming Algorithms  • Parallel Architectures & Algorithms  • Joining datasets & Record Linkage  – Theta Joins: or how to op\mally join two large datasets  – Clustering similar documents using minHash  – Iden\fying matching users across social networks  – Correla\on Clustering  – Markov Logic Networks  Lecture 1 : 590.02 Spring 13  21 

SAMPLING  Lecture 1 : 590.02 Spring 13  22 

Why Sampling?  • Approximately compute quan\\es when  – Processing the en\re dataset takes too long.   How many tweets menQon Obama?  – Computa\on is intractable  Number of saQsfying assignments for a DNF.  – Do not have access or expensive to get access to en\re data.  How many restaurants does Google know about?  Number of users in Facebook whose birthday is today.  What fracQon of the populaQon has the flu?    Lecture 1 : 590.02 Spring 13  23 

Zero‐One Es\mator Theorem  Input: A universe of items U (e.g., all tweets)         A subset G (e.g., tweets men\oning Obama)  Goal: Es\mate  μ = |G|/|U|   Algorithm:  • Pick N samples from U {x1, x2, …, xN}  • For each sample, let Yi = 1 if xi ε G.   • Output: Y = Σ Yi/N  Theorem : Let ε < 2. If N > (1/μ) (4 ln(2/δ)/ε 2 ),  then   Pr[(1‐ε) μ < Y < (1+ε)μ] > 1‐δ  Lecture 1 : 590.02 Spring 13  24 

Zero‐One Es\mator Theorem  Algorithm:  • Pick N samples from U {x1, x2, …, xN}  • For each sample, let Yi = 1 if xi ε G.   • Output: Y = Σ Yi/N  Theorem : Let ε < 2. If N > (1/μ) (4 ln(2/δ)/ε 2 ),  then   Pr[(1‐ε) μ < Y < (1+ε)μ] > 1‐δ  Proof: Homework  Lecture 1 : 590.02 Spring 13  25 

Simple Random Sample  • Given a table of size N, pick a subset of  n rows, such that each  subset of n rows is equally likely.   • How to sample n rows?  • … if we don’t know N?   Lecture 1 : 590.02 Spring 13  26 

Reservoir Sampling  Highlights:   • Make one pass over the data  • Maintain a reservoir of n records.   • A}er reading t rows, the reservoir is a simple random sample of  the first t rows.   Lecture 1 : 590.02 Spring 13  27 

Reservoir Sampling  [ViCer ACM ToMS ‘85]   Algorithm R:   • Ini\alize reservoir to the first n rows.   • For the (t+1) st  row R,   – Pick a random number m between 1 and t+1  – If m <= n, then replace the m th  row in the reservoir with R   Lecture 1 : 590.02 Spring 13  28 

Proof  Lecture 1 : 590.02 Spring 13  29 

Proof  • If N = n, then P [ row is in sample] = 1. Hence, reservoir contains  all the rows in the table.  • Suppose for N = t, the reservoir is a simple random sample.  That is, each row has n/t chance of appearing in the sample.   • For N = t+1:   – (t+1)st row is included in the sample with probability n/(t+1)  – Any other row:   P[ row is in reservoir] = P[ row is in reservoir a}er t steps]* P[ row is not                  replaced]           = n/t * (1‐1/(t+1)) = n/(t+1)   Lecture 1 : 590.02 Spring 13  30 

AlgorithmsforBigData Management CompSci590.02 - PowerPoint PPT Presentation

AlgorithmsforBigData Management CompSci590.02 Instructor:AshwinMachanavajjhala Lecture1:590.02Spring13 1 Administrivia hCp://www.cs.duke.edu/courses/spring13/compsci590.2/

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

TOWN OF NEW TECUMSETH 9 th Line Bridge Improvements Schedule C Municipal Class

MAKING MATCHES A Greenbelt Fund and National Good Food Network Webinar September 20, 2018 Web

4/18/16 Review 2D array exercise Lab 5 is now available.

Graph MayonnAIse in silico condiment synthesis Mayonnaise $12.5 billion industry by 2023

Deep Learning - Theory and Practice [16-1-2020] Sriram Ganapathy web -

Deep Learning - Theory and Practice Sriram Ganapathy sriramg@iisc.ac.in C 334, Electrical

Sermon #234 Song of Songs 4:12-5:1 February 25, 2018 The Garden Fountain A small, white package

WELCOME! Were glad youre here! Today, we will: Welcome and Introductions Why Early

AlgorithmsforBigData Management CompSci590.02 - PowerPoint PPT Presentation

AlgorithmsforBigData Management CompSci590.02 Instructor:AshwinMachanavajjhala Lecture1:590.02Spring13 1 Administrivia hCp://www.cs.duke.edu/courses/spring13/compsci590.2/

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

TOWN OF NEW TECUMSETH 9 th Line Bridge Improvements Schedule C Municipal Class

MAKING MATCHES A Greenbelt Fund and National Good Food Network Webinar September 20, 2018 Web

4/18/16 Review 2D array exercise Lab 5 is now available.

Graph MayonnAIse in silico condiment synthesis Mayonnaise $12.5 billion industry by 2023

Deep Learning - Theory and Practice [16-1-2020] Sriram Ganapathy web -

Deep Learning - Theory and Practice Sriram Ganapathy sriramg@iisc.ac.in C 334, Electrical

Sermon #234 Song of Songs 4:12-5:1 February 25, 2018 The Garden Fountain A small, white package

WELCOME! Were glad youre here! Today, we will: Welcome and Introductions Why Early

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring