algorithms for big data management
play

AlgorithmsforBigData Management CompSci590.02 - PowerPoint PPT Presentation

AlgorithmsforBigData Management CompSci590.02 Instructor:AshwinMachanavajjhala Lecture1:590.02Spring13 1 Administrivia hCp://www.cs.duke.edu/courses/spring13/compsci590.2/


  1. Algorithms
for
Big‐Data
 Management
 CompSci
590.02
 Instructor:
Ashwin
Machanavajjhala
 Lecture
1
:
590.02
Spring
13
 1


  2. Administrivia
 hCp://www.cs.duke.edu/courses/spring13/compsci590.2/
 • Tue/Thu
3:05
–
4:20
PM

 • “Reading
Course
+
Project”
 – No
exams!
 – Every
class
based
on
1
(or
2)
assigned
papers
that
students
 must 
read.
 • Projects:
(50%
of
grade)
 – Individual
or
groups
of
size
2‐3
 • Class
Par\cipa\on
+
assignments
(other
50%)
 • Office
hours:
by
appointment
 Lecture
1
:
590.02
Spring
13
 2


  3. Administrivia
 • Projects:
(50%
of
grade)
 – Ideas
will
be
posted
in
the
coming
weeks
 • Goals:
 – Literature
review
 – Some
original
research/implementa\on
 • Timeline
(details
will
be
posted
on
the
website
soon)
 – ≤Feb
12:
Choose
Project
(ideas
will
be
posted
…
new
ideas
welcome)
 – Feb
21:
Project
proposal
(1‐4
pages
describing
the
project)
 – Mar
21:
Mid‐project
review
(2‐3
page
report
on
progress)
 – Apr
18:
Final
presenta\ons
and
submission
(6‐10
page
conference
style
paper
 +
20
minute
talk)
 Lecture
1
:
590.02
Spring
13
 3


  4. Why
you
should
take
this
course?
 Industry,
academic
and
government
research
iden\fies
the
value
 • of
analyzing
large
data
collec\ons
in
all
walks
of
life.

 “What
Next?
A
Half‐Dozen
Data
Management
Research
Goals
for
Big
 – Data
and
Cloud”,
Surajit
Chaudhuri,
MicrosoO
Research
 – “Big
data:
The
next
fronQer
for
innovaQon,
compeQQon,
and
 producQvity”,
McKinsey
Global
InsQtute
Report,
2011
 Lecture
1
:
590.02
Spring
13
 4


  5. Why
you
should
take
this
course?
 Very
ac\ve
field
and
tons
of
interes\ng
research.

 • We
will
read
papers
in:
 Data
Management
 – Theory

 – Machine
Learning
 – …
 – Lecture
1
:
590.02
Spring
13
 5


  6. Why
you
should
take
this
course?
 Intro
to
research
by
working
on
a
cool
project
 • Read
scienQfic
papers
 – Formulate
a
problem
 – Perform
a
scienQfic
evaluaQon
 – Lecture
1
:
590.02
Spring
13
 6


  7. Today
 • Course
overview
 • An
algorithm
for
sampling
 Lecture
1
:
590.02
Spring
13
 7


  8. INTRODUCTION
 Lecture
1
:
590.02
Spring
13
 8


  9. What
is
Big
Data?
 Lecture
1
:
590.02
Spring
13
 9


  10. hCp://visual.ly/what‐big‐data
 Lecture
1
:
590.02
Spring
13
 10


  11. hCp://visual.ly/what‐big‐data
 Lecture
1
:
590.02
Spring
13
 11


  12. 3
Key
Trends
 • Increased
data
collec\on
 • (Shared
nothing)
Parallel
processing
frameworks
on
commodity
 hardware
 • Powerful
analysis
of
trends
by
linking
data
from
heterogeneous
 sources
 Lecture
1
:
590.02
Spring
13
 12


  13. Big‐Data
impacts
all
aspects
of
our
life

 Lecture
1
:
590.02
Spring
13
 13


  14. The
value
in
Big‐Data
…
 Recommended
links 
 Top
Searches 
 Personalized

 News
Interests 
 +43% clicks +79% clicks +250% clicks vs. editor selected vs. randomly selected vs. editorial one size fits all Lecture
1
:
590.02
Spring
13
 14


  15. The
value
in
Big‐Data
…
 “ If
 US
healthcare
 were
to
use
 big
data

 creaQvely
and
effecQvely
to
drive
efficiency
and
 quality,
the
sector
could
create
more
than
 $300
billion
in
value
every
year .
 ”
 McKinsey
Global
Ins\tute
Report
 Lecture
1
:
590.02
Spring
13
 15


  16. Example:
Google
Flu
 Lecture
1
:
590.02
Spring
13
 16


  17. hCp://www.ccs.neu.edu/home/amislove/twiCermood/
 Lecture
1
:
590.02
Spring
13
 17


  18. Course
Overview
 • Sampling

 – Reservoir
Sampling
 – Sampling
with
indices
 – Sampling
from
Joins
 – Markov
chain
Monte
Carlo
sampling
 – Graph
Sampling
&
PageRank
 Lecture
1
:
590.02
Spring
13
 18


  19. Course
Overview
 • Sampling

 • Streaming
Algorithms

 – Sketches
 – Online
Aggrega\on
 – Windowed
queries
 – Online
learning
 Lecture
1
:
590.02
Spring
13
 19


  20. Course
Overview
 • Sampling

 • Streaming
Algorithms
 • Parallel
Architectures
&
Algorithms
 – PRAM
 – Map
Reduce
 – Graph
processing
architectures
:
Bulk
Synchronous
parallel
and
 asynchronous
models
 – (Graph
connec\vity,
Matrix
Mul\plica\on,
Belief
Propaga\on)
 Lecture
1
:
590.02
Spring
13
 20


  21. Course
Overview
 • Sampling

 • Streaming
Algorithms
 • Parallel
Architectures
&
Algorithms
 • Joining
datasets
&
Record
Linkage
 – Theta
Joins:
or
how
to
op\mally
join
two
large
datasets
 – Clustering
similar
documents
using
minHash
 – Iden\fying
matching
users
across
social
networks
 – Correla\on
Clustering
 – Markov
Logic
Networks
 Lecture
1
:
590.02
Spring
13
 21


  22. SAMPLING
 Lecture
1
:
590.02
Spring
13
 22


  23. Why
Sampling?
 • Approximately
compute
quan\\es
when
 – Processing
the
en\re
dataset
takes
too
long.

 How
many
tweets
menQon
Obama?
 – Computa\on
is
intractable
 Number
of
saQsfying
assignments
for
a
DNF.
 – Do
not
have
access
or
expensive
to
get
access
to
en\re
data.
 How
many
restaurants
does
Google
know
about?
 Number
of
users
in
Facebook
whose
birthday
is
today.
 What
fracQon
of
the
populaQon
has
the
flu?
 
 Lecture
1
:
590.02
Spring
13
 23


  24. Zero‐One
Es\mator
Theorem
 Input:
A
universe
of
items
U
(e.g.,
all
tweets)
 






A
subset
G
(e.g.,
tweets
men\oning
Obama)
 Goal:
Es\mate
 μ
=
|G|/|U| 
 Algorithm:
 • Pick
N
samples
from
U
{x1,
x2,
…,
xN}
 • For
each
sample,
let
Yi
=
1
if
xi
ε
G.

 • Output:
Y
=
Σ
Yi/N
 Theorem :
Let
ε
<
2.
If
N
>
(1/μ)
(4
ln(2/δ)/ε 2 ),
 then

 Pr[(1‐ε)
μ
<
Y
<
(1+ε)μ]
>
1‐δ
 Lecture
1
:
590.02
Spring
13
 24


  25. Zero‐One
Es\mator
Theorem
 Algorithm:
 • Pick
N
samples
from
U
{x1,
x2,
…,
xN}
 • For
each
sample,
let
Yi
=
1
if
xi
ε
G.

 • Output:
Y
=
Σ
Yi/N
 Theorem :
Let
ε
<
2.
If
N
>
(1/μ)
(4
ln(2/δ)/ε 2 ),
 then

 Pr[(1‐ε)
μ
<
Y
<
(1+ε)μ]
>
1‐δ
 Proof:
Homework
 Lecture
1
:
590.02
Spring
13
 25


  26. Simple
Random
Sample
 • Given
a
table
of
size
N,
pick
a
subset
of

n
rows,
such
that
each
 subset
of
n
rows
is
equally
likely.

 • How
to
sample
n
rows?
 • …
if
we
don’t
know
N?

 Lecture
1
:
590.02
Spring
13
 26


  27. Reservoir
Sampling
 Highlights:

 • Make
one
pass
over
the
data
 • Maintain
a
reservoir
of
n
records.

 • A}er
reading
t
rows,
the
reservoir
is
a
simple
random
sample
of
 the
first
t
rows.

 Lecture
1
:
590.02
Spring
13
 27


  28. Reservoir
Sampling
 [ViCer
ACM
ToMS
‘85] 
 Algorithm
R:

 • Ini\alize
reservoir
to
the
first
n
rows.

 • For
the
(t+1) st
 row
R,

 – Pick
a
random
number
m
between
1
and
t+1
 – If
m
<=
n,
then
replace
the
m th 
row
in
the
reservoir
with
R

 Lecture
1
:
590.02
Spring
13
 28


  29. Proof
 Lecture
1
:
590.02
Spring
13
 29


  30. Proof
 • If
N
=
n,
then
P
[
row
is
in
sample]
=
1.
Hence,
reservoir
contains
 all
the
rows
in
the
table.
 • Suppose
for
N
=
t,
the
reservoir
is
a
simple
random
sample.
 That
is,
each
row
has
n/t
chance
of
appearing
in
the
sample.

 • For
N
=
t+1:

 – (t+1)st
row
is
included
in
the
sample
with
probability
n/(t+1)
 – Any
other
row:

 P[
row
is
in
reservoir]
=
P[
row
is
in
reservoir
a}er
t
steps]*
P[
row
is
not




 
 
 







replaced]
 
 
 




=
n/t
*
(1‐1/(t+1))
=
n/(t+1)

 Lecture
1
:
590.02
Spring
13
 30


Recommend


More recommend