CS345a: Data Mining Jure Leskovec Stanford University CPU Machine - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec Stanford University

CPU Machine Learning, Statistics Memory Memory “Classical” Data Mining Disk 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 2

 20+ billion web pages x 20KB = 400+ TB  20+ billion web pages x 20KB = 400+ TB  1 computer reads 30 ‐ 35 MB/sec from disk  ~4 months to read the web  ~4 months to read the web  ~1,000 hard drives to store the web  Even more to do something with the data 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 3

 Web data sets can be very large y g  Tens to hundreds of terabytes  Cannot mine on a single server g  Standard architecture emerging:  Cluster of commodity Linux nodes Cluster of commodity Linux nodes  Gigabit ethernet interconnect  How to organize computations on this How to organize computations on this architecture?  Mask issues such as hardware failure Mask issues such as hardware failure 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 4

 Traditional big ‐ iron box (circa 2003) Traditional big iron box (circa 2003)  8 2GHz Xeons  64GB RAM  8TB disk  758,000 USD  Prototypical Google rack (circa 2003) Prototypical Google rack (circa 003)  176 2GHz Xeons  176GB RAM  ~7TB disk d k  278,000 USD  In Aug 2006 Google had ~450,000 machines 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 5

2 ‐ 10 Gbps backbone between racks 1 Gbps between Gb b t Switch any pair of nodes in a rack S it h Switch S it h Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16 64 nodes Each rack contains 16 ‐ 64 nodes 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 6

 Large scale computing for data mining problems L l ti f d t i i bl on commodity hardware  PCs connected in a network  Need to process huge datasets on large clusters of computers  Challenges:  Challenges:  How do you distribute computation?  Distributed programming is hard  Machines fail  Map ‐ reduce addresses all of the above  Google’s computational/data manipulation model Google s computational/data manipulation model  Elegant way to work with big data 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 7

 Yahoo’s collaboration with academia Y h ’ ll b ti ith d i  Foster open research  Focus on large ‐ scale, highly parallel Focus on large scale, highly parallel computing  Seed Facility: M45 y  Datacenter in a Box (DiB)  1000 nodes, 4000 cores, 3TB RAM, 1 5PB disk 1.5PB disk  High bandwidth connection to Internet  Located on Yahoo! corporate campus p p  World’s top 50 supercomputer 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 8

 Implications of such computing environment  Implications of such computing environment  Single machine performance does not matter  Add more machines  Add more machines  Machines break  One server may stay up 3 years (1 000 days)  One server may stay up 3 years (1,000 days)  If you have 1,0000 servers, expect to loose 1/day  How can we make it easy to write distributed How can we make it easy to write distributed programs? 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 9

 Idea  Idea  Bring computation close to the data  St  Store files multiple times for reliability fil lti l ti f li bilit  Need  Programming model  Map ‐ Reduce  Infrastructure – File system  Google: GFS  Hadoop: HDFS 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 10

 First order problem: if nodes can fail how can  First order problem: if nodes can fail, how can we store data persistently?  Answer: Distributed File System  Answer: Distributed File System  Provides global file namespace  Goo le GFS Hadoop HDFS Kosmi KFS  Google GFS; Hadoop HDFS; Kosmix KFS  Typical usage pattern  Huge files (100s of GB to TB) H fil (100 f GB t TB)  Data is rarely updated in place  Reads and appends are common d d d 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 11

 Reliable distributed file system for petabyte scale Reliable distributed file system for petabyte scale  Data kept in 64 ‐ megabyte “chunks” spread across thousands of machines  Each chunk replicated, usually 3 times, on different machines  Seamless recovery from disk or machine failure S l f di k hi f il C 1 C 0 D 0 C 1 C 2 C 5 C 0 C 5 … D 0 D 1 D 0 C 5 C 2 C 5 C 3 C 2 Chunk server 1 Chunk server 3 Chunk server N Chunk server 2 Bring computation directly to the data! 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 12

Chunk Servers  Chunk Servers  File is split into contiguous chunks  Typically each chunk is 16 ‐ 64MB  Each chunk replicated (usually 2x or 3x) E h h k li t d ( ll 2 3 )  Try to keep replicas in different racks  Master node  a.k.a. Name Nodes in HDFS  Stores metadata  Might be replicated Might be replicated  Client library for file access  Talks to master to find chunk servers  Connects directl to ch nkser ers to access data  Connects directly to chunkservers to access data 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 13

 We have a large file of words:  We have a large file of words:  one word per line  Count the number of times each distinct word appears in the file pp  Sample application:  analyze web server logs to find popular URLs 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 14

 Case 1: Entire file fits in memory  Case 1: Entire file fits in memory  Case 2: File too large for mem, but all <word, count> pairs fit in mem count> pairs fit in mem  Case 3: File on disk, too many distinct words to fit in memory to fit in memory  sort datafile | uniq –c 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 15

 To make it slightly harder, suppose we have a  To make it slightly harder suppose we have a large corpus of documents  Count the number of times each distinct word occurs in the corpus  words(docs/*) | sort | uniq -c  where words takes a file and outputs the words in it, one to a line  The above captures the essence of Th b t th f MapReduce  Great thing is it is naturally parallelizable  Great thing is it is naturally parallelizable 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 16

 Read a lot of data  Read a lot of data  Map  Extract something you care about  Extract something you care about  Shuffle and Sort  Reduce  Reduce  Aggregate, summarize, filter or transform  Write the data  Write the data Outline stays the same, map and reduce change to fit the problem 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 17

 Program specifies two primary methods:  Program specifies two primary methods:  Map(k,v)  <k’, v’>*  R d  Reduce(k’, <v’>*)  <k’, v’’>* (k’ < ’>*)  <k’ ’’>*  All v’ with same k’ are reduced together and  All v’ with same k’ are reduced together and processed in v’ order 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 18

Provided by the Provided by the programmer programmer MAP: Reduce: Group by key: reads input and Collect all values Collect all pairs Collect all pairs produces a set of d f b l belonging to the h with same key key value pairs key and output data ds (the 1) (the, 1) (crew 1) (crew, 1) ntial read y read the d The crew of the space shuttle The crew of the space shuttle Endeavor recently returned to (crew, 1) (crew, 1) Earth as ambassadors, (crew, 2) harbingers of a new era of (of, 1) (space, 1) space exploration. Scientists (space, 1) at NASA are saying that the (the, 1) (the, 1) recent assembly of the Dextre (the, 3) (the 3) bot is the first step in a long ‐ bot is the first step in a long equentially nly seque (space, 1) (the, 1) term space ‐ based (shuttle, 1) man/machine partnership. (shuttle, 1) (the, 1) '"The work we're doing now ‐‐ (recently, 1) the robotics we're doing ‐‐ is (Endeavor, 1) (shuttle, 1) what we're going to need to … do to build any work station (recently, 1) (recently, 1) (recently, 1) (recently, 1) or habitat structure on the Se On moon or Mars," said Allard …. … Beutel. Big document (key, value) (key, value) (key, value) 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 19

map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) ( , ) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result) emit(result) 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 20

 Map ‐ Reduce environment takes care of:  Map ‐ Reduce environment takes care of:  Partitioning the input data  Scheduling the program’s execution across a set of  Scheduling the program s execution across a set of machines  Handling machine failures g  Managing required inter ‐ machine communication  Allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed cluster 21 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining

Big document MAP: reads input and produces a set of key value pairs key value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the belonging to the key and output 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 22

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory Memory Classical Data Mining Disk 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 2 20+ billion web pages x 20KB = 400+ TB 20+

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

http://cs224w.stanford.edu Nodes Nodes Network Adjacency matrix 11/30/17 Jure Leskovec,

Analytics on Sensor Networks Joint work with D. D. Ha Hallac , S. Vare, S. Bhooshan, R. Sosic, S.

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

CS 105 x86-64 Linux Memory Layout x86-64 Linux Memory Layout Tour of Black Holes of Computing

Apache As A Malware-Scanning Proxy Jeremy Stashewsky, Sophos Plc. http://www.sophos.com/

COMP598: Advanced Computational Biology Methods & Research Exploring the RNA mutational

Advanced Topics in Information Retrieval 9. Social Media Jannik Strtgen Vinay Setty

CS5412: DANGERS OF CONSOLIDATION Lecture XXIII Ken Birman Are Clouds Inherently Dangerous? 2

CS161 Midterm 2 Review Midterm 2: April 29, 18:30-20:00

Chapter 8 Digital Design and Computer Architecture , 2 nd Edition David Money Harris and Sarah L.

Leakage-Resilient Cryptography with Key Derived from Sensitive Data Konrad Durnoga, Stefan

Sambuz

Useful Links

Newsletter

Mail Us

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory Memory Classical Data Mining Disk 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 2 20+ billion web pages x 20KB = 400+ TB 20+

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

http://cs224w.stanford.edu Nodes Nodes Network Adjacency matrix 11/30/17 Jure Leskovec,

Analytics on Sensor Networks Joint work with D. D. Ha Hallac , S. Vare, S. Bhooshan, R. Sosic, S.

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

CS 105 x86-64 Linux Memory Layout x86-64 Linux Memory Layout Tour of Black Holes of Computing

Apache As A Malware-Scanning Proxy Jeremy Stashewsky, Sophos Plc. http://www.sophos.com/

COMP598: Advanced Computational Biology Methods &amp; Research Exploring the RNA mutational

Advanced Topics in Information Retrieval 9. Social Media Jannik Strtgen Vinay Setty

CS5412: DANGERS OF CONSOLIDATION Lecture XXIII Ken Birman Are Clouds Inherently Dangerous? 2

CS161 Midterm 2 Review Midterm 2: April 29, 18:30-20:00

Chapter 8 Digital Design and Computer Architecture , 2 nd Edition David Money Harris and Sarah L.

Leakage-Resilient Cryptography with Key Derived from Sensitive Data Konrad Durnoga, Stefan

Sambuz

Useful Links

Newsletter

Mail Us

COMP598: Advanced Computational Biology Methods & Research Exploring the RNA mutational