large scale data mining
play

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms - PowerPoint PPT Presentation

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce Information retrieval


  1. Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

  2. Part 2:Mining using MapReduce  Mining algorithms using MapReduce  Information retrieval  Graph algorithms: PageRank  Clustering: Canopy clustering, KMeans  Classification: kNN, Naïve Bayes  MapReduce Mining Summary 2

  3. MapReduce Interface and Data Flow  Map : (K1, V1)  list(K2, V2)  Combine : (K2, list(V2))  list(K2, V2)  Partition: (K2, V2)  reducer_id  Reduce : (K2, list(V2))  list(K3, V3) id, doc list(w, id) list(unique_w, id) Host 1 w1, list(unique_id) Host 3 w1, list(id) Map Combine Reduce Map Combine Host 2 Host 4 Map Combine w2, list(id) Map Combine Reduce w2, list(unique_id) partition 3

  4. Information retrieval using MapReduce 4

  5. IR: Distributed Grep  Find the doc_id and line# of a matching pattern  Map: (id, doc)  list(id, line#)  Reduce: None Grep “data mining” Docs Output 1 2 Map1 <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5

  6. IR: URL Access Frequency  Map: (null, log)  (URL, 1)  Reduce: (URL, 1)  (URL, total_count) Logs Map output Reduce output <u1,1> Map1 <u1,1> <u1,2> <u2,1> <u2,1> Reduce Map2 <u3,2> <u3,1> <u3,1> Map3 Also described in Part 1 6

  7. IR: Reverse Web-Link Graph  Map: (null, page)  (target, source)  Reduce: (target, source)  (target, list(source)) Pages Map output Reduce output <t1,s2> Map1 <t1,[s2]> <t2,s3> <t2,[s3,s5]> Reduce Map2 <t2,s5> <t3,[s5]> <t3,s5> Map3 It is the same as matrix transpose 7

  8. IR: Inverted Index  Map: (id, doc)  list(word, id)  Reduce: (word, list(id))  (word, list(id)) Doc Map output Reduce output <w1,1> Map1 <w1,[1,5]> <w2,2> Reduce <w2,[2]> Map2 <w3,3> <w3,[3]> <w1,5> Map3 8

  9. Graph mining using MapReduce 9

  10. PageRank 2  PageRank vector q is defined as 1 where q = c A T q + 1 ¡ c N e 3 4 Browsing Teleporting 0 1  A is the source-by-destination 0 1 1 1 B C adjacency matrix, 0 0 1 1 B C A = @ A 0 0 0 1  e is all one vector. 0 0 1 0  N is the number of nodes  c is the weight between 0 and 1 (eg.0.85)  PageRank indicates the importance of a page.  Algorithm: Iterative powering for finding the first eigen-vector 10

  11. MapReduce : PageRank PageRank Map() 2 1  Input: key = page x , value = (PageRank q x , links[y 1 … y m ])  Output: key = page x, value = partial x 3 4 1. Emit(x, 0) //guarantee all pages will be emitted Map: distribute PageRank q i 2. For each outgoing link y i : • Emit(y i , q x /m) q3 q1 q2 q4 PageRank Reduce()  Input: key = page x , value = the list of [partial x ]  Output: key = page x, value = PageRank q x 1. q x = 0 q1 q2 q3 q4 2. For each partial value d in the list: • Reduce: update new PageRank q x += d 3. q x = cq x + (1-c)/N 4. Emit(x, q x ) Check out Kang et al ICDM’09 11

  12. Clustering using MapReduce 12

  13. Canopy: single-pass clustering Canopy creation  Construct overlapping clusters – canopies  Make sure no two canopies with too much overlaps  Key: no canopy centers are too close to each other T1 C1 C3 C2 C4 T2 too much overlap two thresholds overlapping clusters T1>T2 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 13

  14. Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 14

  15. Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) T1 T2 – For each canopy c:  if dist(p,c)< T1: c.add(p) Strongly marked points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p C1 C2  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 15

  16. Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q Other points  While (Q is not empty) in the cluster – p = dequeue(Q) – For each canopy c: Canopy center  if dist(p,c)< T1: c.add(p) Strongly marked points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 16

  17. Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 17

  18. Canopy creation Input :1)points 2) threshold T1, T2 where T1>T2 Output : cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 18

  19. MapReduce - Canopy Map() Canopy creation Map()  Input: A set of points P , threshold T1, T2  Output: key = null; value = a list of local canopies (total, count)  For each p in P:  For each canopy c: • if dist(p,c)< T1 then c.total+=p, c.count++; Map1 • if dist(p,c)< T2 then strongBound = true Map2 • If not strongBound then create canopy at p Close()  For each canopy c: • Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 19

  20. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Map1 results c.count++; Map2 results • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() For simplicity we assume only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 20

  21. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Reducer results c.count++; • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() Remark: This assumes only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 21

  22. Clustering Assignment Clustering assignment  For each point p  Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 22

  23. MapReduce : Cluster Assignment Cluster assignment Map()  Input: Point p; cluster centriods  Output:  Output key = cluster id  Output value =point id  currentDist = inf  For each cluster centroids c Canopy center  If dist(p,c)<currentDist then bestCluster=c, currentDist=dist(p,c);  Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 23

  24. KMeans: Multi-pass clustering AssignCluster() Traditional AssignCluster(): • For each point p Kmeans () Assign p the closest c  While not converge: UpdateCentroid()  AssignCluster() UpdateCentroids ():  UpdateCentroids()  For each cluster Update cluster center 24

Recommend


More recommend