Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Part 2:Mining using MapReduce Mining algorithms using MapReduce Information retrieval Graph algorithms: PageRank Clustering: Canopy clustering, KMeans Classification: kNN, Naïve Bayes MapReduce Mining Summary 2
MapReduce Interface and Data Flow Map : (K1, V1) list(K2, V2) Combine : (K2, list(V2)) list(K2, V2) Partition: (K2, V2) reducer_id Reduce : (K2, list(V2)) list(K3, V3) id, doc list(w, id) list(unique_w, id) Host 1 w1, list(unique_id) Host 3 w1, list(id) Map Combine Reduce Map Combine Host 2 Host 4 Map Combine w2, list(id) Map Combine Reduce w2, list(unique_id) partition 3
Information retrieval using MapReduce 4
IR: Distributed Grep Find the doc_id and line# of a matching pattern Map: (id, doc) list(id, line#) Reduce: None Grep “data mining” Docs Output 1 2 Map1 <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5
IR: URL Access Frequency Map: (null, log) (URL, 1) Reduce: (URL, 1) (URL, total_count) Logs Map output Reduce output <u1,1> Map1 <u1,1> <u1,2> <u2,1> <u2,1> Reduce Map2 <u3,2> <u3,1> <u3,1> Map3 Also described in Part 1 6
IR: Reverse Web-Link Graph Map: (null, page) (target, source) Reduce: (target, source) (target, list(source)) Pages Map output Reduce output <t1,s2> Map1 <t1,[s2]> <t2,s3> <t2,[s3,s5]> Reduce Map2 <t2,s5> <t3,[s5]> <t3,s5> Map3 It is the same as matrix transpose 7
IR: Inverted Index Map: (id, doc) list(word, id) Reduce: (word, list(id)) (word, list(id)) Doc Map output Reduce output <w1,1> Map1 <w1,[1,5]> <w2,2> Reduce <w2,[2]> Map2 <w3,3> <w3,[3]> <w1,5> Map3 8
Graph mining using MapReduce 9
PageRank 2 PageRank vector q is defined as 1 where q = c A T q + 1 ¡ c N e 3 4 Browsing Teleporting 0 1 A is the source-by-destination 0 1 1 1 B C adjacency matrix, 0 0 1 1 B C A = @ A 0 0 0 1 e is all one vector. 0 0 1 0 N is the number of nodes c is the weight between 0 and 1 (eg.0.85) PageRank indicates the importance of a page. Algorithm: Iterative powering for finding the first eigen-vector 10
MapReduce : PageRank PageRank Map() 2 1 Input: key = page x , value = (PageRank q x , links[y 1 … y m ]) Output: key = page x, value = partial x 3 4 1. Emit(x, 0) //guarantee all pages will be emitted Map: distribute PageRank q i 2. For each outgoing link y i : • Emit(y i , q x /m) q3 q1 q2 q4 PageRank Reduce() Input: key = page x , value = the list of [partial x ] Output: key = page x, value = PageRank q x 1. q x = 0 q1 q2 q3 q4 2. For each partial value d in the list: • Reduce: update new PageRank q x += d 3. q x = cq x + (1-c)/N 4. Emit(x, q x ) Check out Kang et al ICDM’09 11
Clustering using MapReduce 12
Canopy: single-pass clustering Canopy creation Construct overlapping clusters – canopies Make sure no two canopies with too much overlaps Key: no canopy centers are too close to each other T1 C1 C3 C2 C4 T2 too much overlap two thresholds overlapping clusters T1>T2 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 13
Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids Put all points into a queue Q While (Q is not empty) – p = dequeue(Q) – For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 14
Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids Put all points into a queue Q While (Q is not empty) – p = dequeue(Q) T1 T2 – For each canopy c: if dist(p,c)< T1: c.add(p) Strongly marked points if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p C1 C2 For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 15
Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids Put all points into a queue Q Other points While (Q is not empty) in the cluster – p = dequeue(Q) – For each canopy c: Canopy center if dist(p,c)< T1: c.add(p) Strongly marked points if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 16
Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids Put all points into a queue Q While (Q is not empty) – p = dequeue(Q) – For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 17
Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids Put all points into a queue Q While (Q is not empty) – p = dequeue(Q) – For each canopy c: if dist(p,c)< T1: c.add(p) if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 18
MapReduce - Canopy Map() Canopy creation Map() Input: A set of points P , threshold T1, T2 Output: key = null; value = a list of local canopies (total, count) For each p in P: For each canopy c: • if dist(p,c)< T1 then c.total+=p, c.count++; Map1 • if dist(p,c)< T2 then strongBound = true Map2 • If not strongBound then create canopy at p Close() For each canopy c: • Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 19
MapReduce - Canopy Reduce() Reduce() Input: key = null; input values (total, count) Output: key = null; value = cluster centroids For each intermediate values p = total/count For each canopy c: • if dist(p,c)< T1 then c.total+=p, Map1 results c.count++; Map2 results • if dist(p,c)< T2 then strongBound = true If not strongBound then create canopy at p Close() For simplicity we assume only one reducer For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 20
MapReduce - Canopy Reduce() Reduce() Input: key = null; input values (total, count) Output: key = null; value = cluster centroids For each intermediate values p = total/count For each canopy c: • if dist(p,c)< T1 then c.total+=p, Reducer results c.count++; • if dist(p,c)< T2 then strongBound = true If not strongBound then create canopy at p Close() Remark: This assumes only one reducer For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 21
Clustering Assignment Clustering assignment For each point p Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 22
MapReduce : Cluster Assignment Cluster assignment Map() Input: Point p; cluster centriods Output: Output key = cluster id Output value =point id currentDist = inf For each cluster centroids c Canopy center If dist(p,c)<currentDist then bestCluster=c, currentDist=dist(p,c); Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 23
KMeans: Multi-pass clustering AssignCluster() Traditional AssignCluster(): • For each point p Kmeans () Assign p the closest c While not converge: UpdateCentroid() AssignCluster() UpdateCentroids (): UpdateCentroids() For each cluster Update cluster center 24
Recommend
More recommend