mapreduce and streaming algorithms for diversity
play

MapReduce and Streaming Algorithms for Diversity Maximization in - PowerPoint PPT Presentation

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dellInformazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U.


  1. MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dell’Informazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) VLDB’17

  2. Outline ◮ Problem definition and applications ◮ Background ◮ Summary of results ◮ Our approach: ◮ Core-set construction ◮ Streaming implementation ◮ MapReduce implementation ◮ Experiments ◮ Diversity maximization under matroid costraints ◮ Summary and future work

  3. Problem definition and applications

  4. Diversity maximization Objective: For a given dataset

  5. Diversity maximization Objective: Determine the most diverse subset of given (small) size k

  6. Applications

  7. Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)

  8. Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction

  9. Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction Facility location ◮ Franchise location (noncompetition) ◮ Strategic facilities (dispersion against simultaneous attacks)

  10. Diversity maximization: formal definition

  11. Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 }

  12. Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ )

  13. Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ ) The k -diversity of S is div k ( S ) = div( S ∗ )

  14. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle

  15. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ )

  16. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ )

  17. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ )

  18. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ )

  19. Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ ) Except for remote-clique, all problems are max-min optimizations.

  20. Background

  21. Previous work Sequential approximation and hardness results Problem Seq. Approx. LB Remote-edge 2 ≥ 2 Remote-clique 2 ≥ 2 − ǫ Remote-star 2 – Remote-bipartition 3 – Remote-tree 4 ≥ 2 Remote-cycle 3 ≥ 2 Specialized results (hardness and better approx. ratios) for remote clique and remote edge under Euclidean distances

  22. Previous work β -core-set [Agarwal et al.’04]

  23. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )

  24. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )

  25. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy

  26. Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy ◮ Approximate solution can be computed on T .

  27. Previous work β -composable core-set [Indyk et al.’14]

  28. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ

  29. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set

  30. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set

  31. Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set ◮ Application to MapReduce and Streaming frameworks

  32. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9

  33. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio

  34. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces

  35. Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces ◮ Core-set size: k

  36. Computational Frameworks for Massive Data Analysis

  37. Computational Frameworks for Massive Data Analysis MapReduce ◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into (small) subsets, processed in parallel ◮ Goals: few rounds, sublinear local space, linear overall space.

Recommend


More recommend