MapReduce and Streaming Algorithms for Center-Based Clustering in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci DEI, University of Padova, Italy Based on joint works with: M. Ceccarello, A. Mazzetto, and A. Pietracaprina 1

Center-based clustering Center-based clustering in general metric spaces: Given a pointset S in a metric space with distance d ( · , · ), determine a set C ⋆ ⊆ S of k centers minimizing: ◮ max p ∈ S { d ( p , C ⋆ ) } ( k -center ) ◮ � p ∈ S d ( p , C ⋆ ) ( k -median ) p ∈ S ( d ( p , C ⋆ )) 2 ( k -means ) ◮ � Remark: On general metric spaces it makes sense to require that C ⋆ ⊆ S . This assumption is often relaxed in Euclidean spaces (continuos vs discrete version) Variant: Center-based clustering with z outliers: Disregard the z largest distances when computing the objective function. 2

Example: pointset instance 3

Example: solution to 4-center optimal radius r ⋆ ( k ) = max distance of x ∈ S from C ⋆ 4

Example: solution to 4-center with 2 outliers optimal radius r ⋆ ( k , z ) = max distance of non-outlier x ∈ S 5

Center-based clustering for big data 1. Deal with very large pointsets ◮ MapReduce distributed setting ◮ Streaming setting 2. Aim: try to match best sequential approximation ratios with limited local/working space 3. Very simple algorithms with good practical performance 4. Concentrate on k -center with and without outliers [CeccarelloPietracaprinaP, VLDB2019]. 5. End of the talk: sketch very recent results for k -median and k -means [MazzettoPietracaprinaP, arXiv 2019] 6

Outline ◮ Background ◮ MapReduce and Streaming models ◮ Previous work ◮ Doubling Dimension ◮ k center (with and without outliers): ◮ Summary of results ◮ Coreset selection: main idea ◮ MapReduce algorithms ◮ Porting to the Streaming setting ◮ Experiments ◮ Sketch of new results for k -median and k -means 7

Background: MapReduce and Streaming models MapReduce ◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data ( key-value pairs) are mapped by key into subsets and processed in parallel by reducers equipped with small local memory ◮ Goals: few rounds, (substantially) sublinear local memory, linear aggregate memory. Streaming ◮ Data provided as a continuous stream and processed using small working memory ◮ Multiple passes over data may be allowed ◮ Goals: 1 (or few) pass(es), (substantially) sublinear working memory 8

Background: Previous work ◮ Sequential algorithms for general metric spaces: ◮ k -center: 2-approximation ( O ( nk ) time) and 2 − ǫ inapproximability [Gonzalez85] ◮ k -center with z outliers: 3-approximation ( O � n 2 k log n � time) [Charikar+01] ◮ MapReduce algorithms: Reference Rounds Approx. Local Memory k -center problem k 2 | S | ǫ � � [Ene+11] (w.h.p.) O (1 /ǫ ) 10 O � ( | S | k ) 1 / 2 � [Malkomes+15] 2 4 O k -center problem with z outliers � ( | S | ( k + z )) 1 / 2 � [Malkomes+15] 2 13 O 9

Background: Previous work (cont’d) ◮ Streaming algorithms: Reference Passes Approx. Working Memory k -center problem k ǫ − 1 log ǫ − 1 � � [McCutchen+08] 1 2 + ǫ O k -center problem with z outliers � kz ǫ − 1 � [McCutchen+08] 1 4 + ǫ O 10

Background: doubling dimension Our algorithms are analyzed in terms of the doubling dimension D of the metric space: ∀ r : any ball of radius r is covered by ≤ 2 D balls of radius r / 2 r r/2 ◮ Euclidean spaces ◮ Shortest-path distances of mildly expanding topologies ◮ Low-dimensional pointsets of high-dimensional spaces 11

Summary of results Our Algorithms Model Rnd/Pass Approx. Local/Working Memory k -center problem �� ǫ ) D � � �� | S | k ( 4 MapReduce 2 2 + ǫ (4) | S | k O O k -center problem with z outliers �� ǫ ) D � � �� | S | ( k + z ) ( 24 MapReduce 2 3 + ǫ (13) O O | S | ( k + z ) MapReduce �� ǫ ) D � ( 24 2 3 + ǫ O | S | ( k + log | S | ) + z (w.h.p.) � � �� ( k + z ) ( 96 ǫ ) D kz Streaming 1 3 + ǫ (4 + ǫ ) O ǫ ◮ Substantial improvement in approximation quality at the expense of larger memory requirements (constant factor for constant ǫ, D ) ◮ MR algorithms are oblivious to D ◮ Large constants due to the analysis. Experiments show practicality of our approach. 12

Summary of results (cont’d) Main features ◮ (Composable) coreset approach: select small T ⊆ S containing good solution for S and then run (adaptation of) best sequential approximation on T ◮ Flexibility: coreset construction can be either distributed (MapReduce) or streamlined (Streaming) ◮ Adaptivity: Memory/approximation tradeoffs expressed in terms of the doubling dimension d of the pointset ◮ Quality: MR and Streaming algorithms using small memory and almost matching best sequential approximations. 13

Coreset selection: main idea ◮ Let r ⋆ = max distance of any (non-outlier) x ∈ S from closest optimal center ◮ Select a coreset T ⊆ S ensuring that d ( x , T ) ≤ ǫ r ⋆ ∀ x ∈ S − T using sequential h -center approximation, for h suitably larger than k . (Similar idea in [CeccarelloPietracaprinaPUpfal17] for diversity maximization → next talk) ◮ Obs: in general, T must contain outliers 14

Example: pointset instance 15

Example: optimal solution k =4, z =2 16

Example: 10-point coreset T (red points) 17

MapReduce algorithms Basic primitive for coreset selection (based on [Gonzalez85]) Select( S ′ , h , ǫ ): Input: Subset S ′ ⊆ S , parameters h , ǫ > 0 Output: Coreset T ⊆ S ′ of size ≥ h T ← arbitrary point c 1 ∈ S ′ r (1) ← max distance of any x ∈ S ′ from T for i = 2 , 3 , . . . do Find farthest point c i ∈ S ′ from T , and add it to T r ( i ) ← max distance of any x ∈ S ′ from T if (( i ≥ h ) AND ( r ( i ) ≤ ( ǫ/ 2) r ( h ))) then return T Lemma: Let r ∗ ( h ) be the optimal h -center radius for the entire set S and let last the index of the last iteration of Select. Then: r ( last ) ≤ ǫ r ∗ ( h ) Proof idea: by a simple adaptation of Gonzalez’s proof, r ( i = h ) ≤ 2 r ∗ ( h ) 18

MapReduce algorithms: k -center 19

MapReduce algorithms: k -center (cont’d) Analysis ◮ Approximation quality: let C = { c 1 , . . . , c k } be the returned centers. For any x ∈ S j (arbitrary j ) d ( x , C ) ≤ d ( x , t ) + d ( t , C ) ( t ∈ T j closest to x ) ǫ r ⋆ ( k ) + 2 r ⋆ ( k ) = (2 + ǫ ) r ⋆ ( k ) ≤ ◮ Memory requirements: assume doubling dimension D ◮ set ℓ = � | S | / k ◮ Technical lemma: | T j | ≤ k (4 /ǫ ) D , for every 1 ≤ j ≤ ℓ ⇒ Local memory = O �� | S | k (4 /ǫ ) D � . Remarks: ◮ For constant ǫ and D : (2 + ǫ )-approximation with the same memory requirements as the 4-approximation in [Malkomes+15] ◮ Our algorithm is oblivious to D 20

MapReduce algorithms: k -center with z outliers Similar approach to the case without outliers but with some important differences 1. Each coreset T j ⊆ S j must contain ≥ k + z points (making room for outliers) 2. Each t ∈ T j has a weight w ( t ) = number of points of S j − T j for which t is proxy (i.e., closest). Let T w denote the set T j with weights. j 3. On T w = � j T w a suitable weighted variant of the algorithm in [Charikar01+] j (dubbed Charikar w) is run which: ◮ determines k suitable centers (final solution) covering most points of T w ◮ uncovered points of T w have aggregate weight z and are the proxies of the outliers 21

MapReduce algorithms: k -center with z outliers (cont’d) 22

MapReduce algorithms: k -center with z outliers (cont’d) Analysis ◮ Approximation quality: let C = { c 1 , . . . , c k } be the returned centers. For any non-outlier x ∈ S j (arbitrary j ) with proxy t ∈ T w j d ( x , t ) ≤ ǫ r ⋆ ( k , z ) and d ( t , C ) ≤ (3 + 5 ǫ ) r ⋆ ( k , z ) ⇒ (3 + ǫ ′ )-approximation for every ǫ ′ > 0. ◮ Memory requirements: assume doubling dimension D ◮ set ℓ = � | S | / ( k + z ) ◮ Technical lemma: | T j | ≤ ( k + z )(4 /ǫ ) D , for every 1 ≤ j ≤ ℓ ⇒ Local memory = O �� | S | ( k + z )(4 /ǫ ) D � . Remarks: ◮ For constant ǫ and D : (3 + ǫ )-approximation with the same memory requirements as the 13-approximation in [Malkomes+15] ◮ Our algorithm is oblivious to D 23

MapReduce algorithms: k -center with z outliers (cont’d) Randomized Variant ◮ Create S 1 , S 2 , . . . , S ℓ as a random partition ( ⇒ z ′ = O ( z /ℓ + log | S | ) outliers per partition w.h.p. ) ◮ Execute the deterministic algorithm with z ′ in lieu of z Analysis ◮ Approximation quality: (3 + ǫ ′ ) (as before) �� (24 /ǫ ) D � ◮ Memory requirements: O | S | ( k + log | S | ) + z Remark: �� ◮ For constant ǫ and D : O | S | ( k + log | S | ) + z local memory (linear dependence on z desirable) 24

MapReduce and Streaming Algorithms for Center-Based Clustering in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci DEI, University of Padova, Italy Based on joint works with: M. Ceccarello, A. Mazzetto, and A. Pietracaprina 1 Center-based clustering

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many

Improving Student Modeling: The Relationship between Learning Styles and Cognitive Traits Sabine

Quantifiers and Working Memory Jakub Szymanik Joint work with Marcin Zajenkowski Amsterdam

WORKING MEMORY Lecturer: Dr. Benjamin Amponsah, Dept. of Psychology, UG, Legon Contact

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Speech segmentation with a neural encoder model of working memory Micha Elsner and Cory Shain

Rule-Based (Expert) Systems Sven Koenig, USC Russell and Norvig, 3 rd Edition, Sections 9.3 and

Instructional Design of a Programming Course A Learning Theoretic Approach Michael E.

MapReduce and Streaming Algorithms for Center-Based Clustering in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Center-Based Clustering in Doubling Spaces Geppino Pucci DEI, University of Padova, Italy Based on joint works with: M. Ceccarello, A. Mazzetto, and A. Pietracaprina 1 Center-based clustering

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many

Improving Student Modeling: The Relationship between Learning Styles and Cognitive Traits Sabine

Quantifiers and Working Memory Jakub Szymanik Joint work with Marcin Zajenkowski Amsterdam

WORKING MEMORY Lecturer: Dr. Benjamin Amponsah, Dept. of Psychology, UG, Legon Contact

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Speech segmentation with a neural encoder model of working memory Micha Elsner and Cory Shain

Rule-Based (Expert) Systems Sven Koenig, USC Russell and Norvig, 3 rd Edition, Sections 9.3 and

Instructional Design of a Programming Course A Learning Theoretic Approach Michael E.

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the