CS345a: Data Mining Jure Leskovec Stanford University
CPU Machine Learning, Statistics Memory Memory “Classical” Data Mining Disk 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 2
20+ billion web pages x 20KB = 400+ TB 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30 ‐ 35 MB/sec from disk ~4 months to read the web ~4 months to read the web ~1,000 hard drives to store the web Even more to do something with the data 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 3
Web data sets can be very large y g Tens to hundreds of terabytes Cannot mine on a single server g Standard architecture emerging: Cluster of commodity Linux nodes Cluster of commodity Linux nodes Gigabit ethernet interconnect How to organize computations on this How to organize computations on this architecture? Mask issues such as hardware failure Mask issues such as hardware failure 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 4
Traditional big ‐ iron box (circa 2003) Traditional big iron box (circa 2003) 8 2GHz Xeons 64GB RAM 8TB disk 758,000 USD Prototypical Google rack (circa 2003) Prototypical Google rack (circa 003) 176 2GHz Xeons 176GB RAM ~7TB disk d k 278,000 USD In Aug 2006 Google had ~450,000 machines 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 5
2 ‐ 10 Gbps backbone between racks 1 Gbps between Gb b t Switch any pair of nodes in a rack S it h Switch S it h Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16 64 nodes Each rack contains 16 ‐ 64 nodes 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 6
Large scale computing for data mining problems L l ti f d t i i bl on commodity hardware PCs connected in a network Need to process huge datasets on large clusters of computers Challenges: Challenges: How do you distribute computation? Distributed programming is hard Machines fail Map ‐ reduce addresses all of the above Google’s computational/data manipulation model Google s computational/data manipulation model Elegant way to work with big data 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 7
Yahoo’s collaboration with academia Y h ’ ll b ti ith d i Foster open research Focus on large ‐ scale, highly parallel Focus on large scale, highly parallel computing Seed Facility: M45 y Datacenter in a Box (DiB) 1000 nodes, 4000 cores, 3TB RAM, 1 5PB disk 1.5PB disk High bandwidth connection to Internet Located on Yahoo! corporate campus p p World’s top 50 supercomputer 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 8
Implications of such computing environment Implications of such computing environment Single machine performance does not matter Add more machines Add more machines Machines break One server may stay up 3 years (1 000 days) One server may stay up 3 years (1,000 days) If you have 1,0000 servers, expect to loose 1/day How can we make it easy to write distributed How can we make it easy to write distributed programs? 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 9
Idea Idea Bring computation close to the data St Store files multiple times for reliability fil lti l ti f li bilit Need Programming model Map ‐ Reduce Infrastructure – File system Google: GFS Hadoop: HDFS 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 10
First order problem: if nodes can fail how can First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Answer: Distributed File System Provides global file namespace Goo le GFS Hadoop HDFS Kosmi KFS Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) H fil (100 f GB t TB) Data is rarely updated in place Reads and appends are common d d d 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 11
Reliable distributed file system for petabyte scale Reliable distributed file system for petabyte scale Data kept in 64 ‐ megabyte “chunks” spread across thousands of machines Each chunk replicated, usually 3 times, on different machines Seamless recovery from disk or machine failure S l f di k hi f il C 1 C 0 D 0 C 1 C 2 C 5 C 0 C 5 … D 0 D 1 D 0 C 5 C 2 C 5 C 3 C 2 Chunk server 1 Chunk server 3 Chunk server N Chunk server 2 Bring computation directly to the data! 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 12
Chunk Servers Chunk Servers File is split into contiguous chunks Typically each chunk is 16 ‐ 64MB Each chunk replicated (usually 2x or 3x) E h h k li t d ( ll 2 3 ) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Might be replicated Client library for file access Talks to master to find chunk servers Connects directl to ch nkser ers to access data Connects directly to chunkservers to access data 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 13
We have a large file of words: We have a large file of words: one word per line Count the number of times each distinct word appears in the file pp Sample application: analyze web server logs to find popular URLs 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 14
Case 1: Entire file fits in memory Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory to fit in memory sort datafile | uniq –c 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 15
To make it slightly harder, suppose we have a To make it slightly harder suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of Th b t th f MapReduce Great thing is it is naturally parallelizable Great thing is it is naturally parallelizable 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 16
Read a lot of data Read a lot of data Map Extract something you care about Extract something you care about Shuffle and Sort Reduce Reduce Aggregate, summarize, filter or transform Write the data Write the data Outline stays the same, map and reduce change to fit the problem 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 17
Program specifies two primary methods: Program specifies two primary methods: Map(k,v) <k’, v’>* R d Reduce(k’, <v’>*) <k’, v’’>* (k’ < ’>*) <k’ ’’>* All v’ with same k’ are reduced together and All v’ with same k’ are reduced together and processed in v’ order 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 18
Provided by the Provided by the programmer programmer MAP: Reduce: Group by key: reads input and Collect all values Collect all pairs Collect all pairs produces a set of d f b l belonging to the h with same key key value pairs key and output data ds (the 1) (the, 1) (crew 1) (crew, 1) ntial read y read the d The crew of the space shuttle The crew of the space shuttle Endeavor recently returned to (crew, 1) (crew, 1) Earth as ambassadors, (crew, 2) harbingers of a new era of (of, 1) (space, 1) space exploration. Scientists (space, 1) at NASA are saying that the (the, 1) (the, 1) recent assembly of the Dextre (the, 3) (the 3) bot is the first step in a long ‐ bot is the first step in a long equentially nly seque (space, 1) (the, 1) term space ‐ based (shuttle, 1) man/machine partnership. (shuttle, 1) (the, 1) '"The work we're doing now ‐‐ (recently, 1) the robotics we're doing ‐‐ is (Endeavor, 1) (shuttle, 1) what we're going to need to … do to build any work station (recently, 1) (recently, 1) (recently, 1) (recently, 1) or habitat structure on the Se On moon or Mars," said Allard …. … Beutel. Big document (key, value) (key, value) (key, value) 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 19
map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) ( , ) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result) emit(result) 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 20
Map ‐ Reduce environment takes care of: Map ‐ Reduce environment takes care of: Partitioning the input data Scheduling the program’s execution across a set of Scheduling the program s execution across a set of machines Handling machine failures g Managing required inter ‐ machine communication Allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed cluster 21 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining
Big document MAP: reads input and produces a set of key value pairs key value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the belonging to the key and output 1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 22
Recommend
More recommend