Diversity maximization in MapReduce and Streaming Under Cardinality - PowerPoint PPT Presentation

Diversity maximization in MapReduce and Streaming Under Cardinality and Matroid Constraints Andrea Pietracaprina University of Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) [VLDB17] and [WSDM18]

Outline ◮ Problem definition and applications ◮ Background ◮ Summary of results ◮ Our approach (cardinality constraint): ◮ Core-set construction ◮ MapReduce implementation ◮ Streaming implementation ◮ Future space savings ◮ Partition and transversal matroids ◮ Experiments ◮ Conclusions and future work

Problem definition and applications

Diversity maximization Objective: For a given dataset, determine the most diverse subset of given (small) size k ⇒ ⇒

Applications ← News/document aggregators ↑ e-commerce ↑ ← Facility location

Diversity maximization: formal definition Given: 1. Set S of points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } 4. Integer k > 1 Return S ∗ ⊂ S , | S ∗ | = k S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ ) s.t.

Matroid constraints Matroids allow to express more complicated constraints, like categorization of elements Partition matroid Transversal matroid Matroid over a set S : M = ( S , I ( S )) rank( M ) = max X ∈I ( S ) | X |

Diversity maximization under matroid constraints Given: 1. Set S of points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } 4. Matroid M = ( S , I ( S )) of rank rank( M ) 5. Integer 1 ≤ k ≤ rank( M ) Return S ∗ ⊂ S , S ∈ I ( S ) , | S ∗ | = k S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ ) s.t.

Diversity measures studied in this work remote-edge remote-clique remote-star remote-bipartition remote-cycle remote-tree All measures are NP-Hard to optimize

Background

Sequential approximation and hardness results Problem Seq. Approx. LB Remote-edge 2 ≥ 2 Remote-clique 2 ≥ 2 − ǫ Remote-star 2 – Remote-bipartition 3 – Remote-tree 4 ≥ 2 Remote-cycle 3 ≥ 2 Specialized results (hardness and better approx. ratios) for remote clique and remote edge under Euclidean distances

(Composable) Core-set β -core-set [Agarwal et al.’95] ◮ A small subset T (core-set) of input S s.t. div k ( T ) ≥ (1 /β ) div k ( S ) ◮ Compute final solution on T . β -composable core-set [Indyk et al.’14] ◮ Partitioned input S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ � T i is a β -core-set

Previous work Known β -composable core-sets for diversity maximization under cardinality constraints ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces ◮ Core-set size: k

Previous work The case of matroid constraints ◮ [Abbassi et al. 13] ◮ Remote-clique measure ◮ Sequential algorithm for remote-clique based on local search ◮ 2 + ǫ approximation ◮ Ω( n 2 ) time

MapReduce and Streaming frameworks MapReduce ◮ Data represented as multiset of key-value pairs ◮ Algorithms execute as sequences of rounds ◮ Architecture: cluster of machines (workers) ◮ One round (distributed among workers): ◮ Map function: applied to each key-value pair ◮ Reduce function: applied to subsets of key-value pairs grouped by key. ◮ Shuffle of data at each round ◮ Performance indicators: #rounds and space required at each worker to execute map/reduce functions

MapReduce and Streaming frameworks Streaming ◮ One processor with limited space ◮ Input provided as a continuous stream: too large to fit in the available memory ◮ ≥ 1 passes over the input ◮ Performance indicators: #passes and space available at the processor. Proposition: The known composable core-sets for k-diversity maximization yield 2-round MapReduce and 1-pass Streaming �� k | S | algorithms using O space.

Summary of Results

Summary of results Our setting Metric spaces of Bounded Doubling Dimension: ∃ D = O (1) s.t. any ball of radius r is covered by ≤ 2 D balls of radius r / 2 r r/2 ◮ Euclidean spaces. ◮ Shortest-path distances of mildly expanding topologies. ◮ Low-dimensional pointsets from arbitrary metric space.

Summary of results Our Results (cardinality constraint): ◮ Improved β -(composable) core-sets: β = 1 + ǫ ◮ Overall approximation: α seq + ǫ ◮ 1-pass Streaming and 2-round MapReduce algorithms using space: Streaming MapReduce �� O ( k ( c /ǫ ) D ) k | S | ( c /ǫ ) D r-edge/cycle O � � O ( k 2 ( c /ǫ ) D ) � other div’s O k | S | ( c /ǫ ) D for a suitable constant c .

Summary of results ◮ 1 extra pass/round brings space bounds for other div ′ s down to those for r-edge/cycle Streaming MapReduce �� O ( k ( c /ǫ ) D ) all div’s O k | S | ( c /ǫ ) D for a suitable constant c .

Summary of results Our Results (remote-clique, matroid constraints): ◮ 2-rounds in MapReduce, 1 pass in streaming ◮ 2 + ǫ approximation ◮ space requirements: Streaming MapReduce � � � O ( k 2 ( c /ǫ ) D ) Partition matroid | S | ( c /ǫ ) D O k � � O ( k 3 ( c /ǫ ) D ) k 2 � | S | ( c /ǫ ) D Transversal matroid O for a suitable constant c .

Summary of results ◮ MapReduce algorithms are oblivious to D . ◮ Streaming algorithms can be made oblivious to D with 1 extra pass.

Our approach (cardinality constaint)

Core-set construction: algorithm Input dataset: S Optimal solution OPT ⊂ S , with | OPT | = k MAIN IDEA: Compute core-set T such that each o ∈ OPT has a (distinct) proxy p ( o ) ∈ T with “small” 1. Partition S into τ > k clusters of small radius ( τ function of doubling dimension) 2. T = { cluster centers } 3. If injectivity of p ( · ) required (remote-clique/start/bipartition/tree): T = { cluster centers } ∪ {≤ k − 1 delegates for each cluster } .

Core-set construction: algorithm ◮ k = 3, τ = 8

Core-set construction: algorithm ◮ Compute τ -center clustering

Core-set construction: algorithm ◮ No injectivity required: T = { cluster centers } ( | T | = τ )

Core-set construction: algorithm ◮ Injectivity required: T = { k points per cluster } ( | T | ≤ k · τ )

Core-set construction: analysis ◮ Radius: r k = min radius of k -clustering of S ◮ Farness: ρ k = max min distance between k points of S Claim: For every k , r k ≤ ρ k Proof: take k centers with Gonzalez85’s algorithm (Farthest-First traversal). Their pairwise distnce is at least the radius r ≥ r k of the associated clustering

Core-set construction: analysis

Core-set construction: analysis Claim: If S has doubling dimension D and τ = (16 /ǫ ) D k then r τ ≤ ǫ/ 8 r k

Core-set construction: analysis ◮ Focus on remote-clique (similiar for other div’s) � k ◮ Let ρ = div(OPT) / � 2 ◮ Observe that: ρ ≥ ρ k ≥ r k Theorem For ǫ < 1 / 2 and τ = (16 /ǫ ) D k, T is a (1 + ǫ ) -core-set for S of size k 2 (16 /ǫ ) D � � O

Core-set construction: analysis Proof. ◮ ∃ an injective p ( · ) such that for each o ∈ OPT, p ( o ) ∈ T and d ( o , p ( o )) ≤ 2 r τ ≤ ( ǫ/ 4) r k ≤ ( ǫ/ 4) ρ k ≤ ( ǫ/ 4) ρ ◮ Hence: � div k ( T ) ≥ d ( p ( o 1 ) , p ( o 2 )) (injectivity!) o 1 , o 2 ∈ OPT � ≥ [ d ( o 1 , o 2 ) − d ( o 1 , p ( o 1 )) − d ( o 2 , p ( o 2 ))] o 1 , o 2 ∈ OPT � k � 2( ǫ/ 4) ρ ≥ div k ( S ) � ≥ d ( o 1 , o 2 ) − 2 (1 + ǫ ) o 1 , o 2 ∈ OPT

Core-set construction: other issues Good clustering? ◮ Optimal k -center clustering is NP-hard. ◮ O (1)-approximation (e.g., [Gonzalez85]) suffices. Composability? ◮ Let S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ Extract a core-set T i ⊂ S i as before ◮ T = � T i is a (1 + ǫ )-core-set of size O ℓ k 2 (1 /ǫ ) D � � .

MapReduce implementation

MapReduce implementation ◮ 2 rounds with O ( k � n ( c /ǫ ) D ) space. ◮ Obliviousness to D : use Farthest-First Traversal algorithm for τ -center stopping at τ > k ensuring sufficiently small radius. ◮ Approximation guarantee: α seq + ǫ . ◮ Random partition yields O ( � kn log n ( c /ǫ ) D ) space w.h.p. ◮ Further decrease in space with multi-round recursion.

Streaming implementation Implementation ◮ Compute a (1 + ǫ )-core-set using a variant of the (2 + δ )-approximate τ -center algorithm of [McCuthcen et � ( c /ǫ ) D � al.’08], with τ = O (knowledge of D required!). ◮ Run sequential approximation on the coreset. Performace ◮ 1 pass, O � k 2 ( c /ǫ ) D � space (no dependence on | S | !) ◮ Approximation guarantee: α seq + ǫ . ◮ Obliviousness to D can be obtained with an extra pass.

Diversity maximization in MapReduce and Streaming Under Cardinality - PowerPoint PPT Presentation

Diversity maximization in MapReduce and Streaming Under Cardinality and Matroid Constraints Andrea Pietracaprina University of Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) [VLDB17] and [WSDM18] Outline

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Streaming -submodular Maximization under Noise subject to Size Constraint Lan N. Nguyen, My

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

User-behavior analytics for video streaming QoE assessment Ricky K. P. Mok The Hong Kong

Pitfalls of data-driven networking: A case study of latent causal confounders in video streaming

Motivation A group of smartphone users who are interested in watching the same video from the

Lower Bounds for Data Streams: A Survey David Woodruff IBM Almaden Outline 1. Streaming model

Technology for Distributed Streaming Analytics John Wu LBNL Use Case 1: Near Real-Time Feature

Sm Smart St Strea eamin ing of of P Panoramic Vi Videos s Students: HW Ma, YZ Cai Kandao

HepSim Monte Carlo samples and their interface with detector simulations S. Chekanov (ANL),

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

Diversity maximization in MapReduce and Streaming Under Cardinality - PowerPoint PPT Presentation

Diversity maximization in MapReduce and Streaming Under Cardinality and Matroid Constraints Andrea Pietracaprina University of Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) [VLDB17] and [WSDM18] Outline

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Streaming -submodular Maximization under Noise subject to Size Constraint Lan N. Nguyen, My

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

User-behavior analytics for video streaming QoE assessment Ricky K. P. Mok The Hong Kong

Pitfalls of data-driven networking: A case study of latent causal confounders in video streaming

Motivation A group of smartphone users who are interested in watching the same video from the

Lower Bounds for Data Streams: A Survey David Woodruff IBM Almaden Outline 1. Streaming model

Technology for Distributed Streaming Analytics John Wu LBNL Use Case 1: Near Real-Time Feature

Sm Smart St Strea eamin ing of of P Panoramic Vi Videos s Students: HW Ma, YZ Cai Kandao

HepSim Monte Carlo samples and their interface with detector simulations S. Chekanov (ANL),

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the