MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension Andrea Pietracaprina Dip. Ingegneria dell’Informazione, Universit` a di Padova Joint work with: M. Ceccarello, G. Pucci (U. Padova), and E. Upfal (Brown U.) VLDB’17
Outline ◮ Problem definition and applications ◮ Background ◮ Summary of results ◮ Our approach: ◮ Core-set construction ◮ Streaming implementation ◮ MapReduce implementation ◮ Experiments ◮ Diversity maximization under matroid costraints ◮ Summary and future work
Problem definition and applications
Diversity maximization Objective: For a given dataset
Diversity maximization Objective: Determine the most diverse subset of given (small) size k
Applications
Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters)
Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction
Applications Aggregator websites (e.g., Google News) ◮ Documents (possibly clustered into categories) ◮ Diversified set of representative docs (from various clusters) E-commerce ◮ Consideration set: products returned by shopping portal in reply to user query ◮ Diversity of returned products (w.r.t. unspecified attributes) correlates to user satisfaction Facility location ◮ Franchise location (noncompetition) ◮ Strategic facilities (dispersion against simultaneous attacks)
Diversity maximization: formal definition
Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 }
Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ )
Diversity maximization: formal definition Given: 1. Set S of n points in a metric space ∆ 2. Distance function d : ∆ × ∆ → R + ∪ { 0 } 3. Integer k > 1 4. (Distance-based) diversity function div : 2 ∆ → R + ∪ { 0 } Return S ∗ ⊂ S , | S ∗ | = k s.t. S ∗ = argmax S ′ ⊆ S , | S ′ | = k div( S ′ ) The k -diversity of S is div k ( S ) = div( S ∗ )
Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle
Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ )
Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ )
Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ )
Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ )
Diversity functions studied in this work div( S ′ ) Problem min p , q ∈ S ′ d ( p , q ) Remote-edge Remote-clique � p , q ∈ S ′ d ( p , q ) w (MSTAR( S ′ )) Remote-star w (MBIP( S ′ )) Remote-bipartition w (MST( S ′ )) Remote-tree w (TSP( S ′ )) Remote-cycle ◮ MSTAR( S ′ ): min-weight star in G ( S ′ ) ( ≡ complete graph over S ′ ) ◮ MBIP( S ′ ): min-weight balanced bipartition of S ′ in G ( S ′ ) ◮ MST( S ′ ): min-weight spanning tree of G ( S ′ ) ◮ TSP( S ′ ): min-weight tour of S ′ in G ( S ′ ) Except for remote-clique, all problems are max-min optimizations.
Background
Previous work Sequential approximation and hardness results Problem Seq. Approx. LB Remote-edge 2 ≥ 2 Remote-clique 2 ≥ 2 − ǫ Remote-star 2 – Remote-bipartition 3 – Remote-tree 4 ≥ 2 Remote-cycle 3 ≥ 2 Specialized results (hardness and better approx. ratios) for remote clique and remote edge under Euclidean distances
Previous work β -core-set [Agarwal et al.’04]
Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )
Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S )
Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy
Previous work β -core-set [Agarwal et al.’04] ◮ (Small) subset T ⊆ S such that div k ( T ) ≥ (1 /β ) div k ( S ) ◮ T filters out redundancy ◮ Approximate solution can be computed on T .
Previous work β -composable core-set [Indyk et al.’14]
Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ
Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set
Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set
Previous work β -composable core-set [Indyk et al.’14] ◮ Input partition S = S 1 ∪ S 2 ∪ · · · ∪ S ℓ ◮ β -composable core-sets T i ⊂ S i ⇒ ∪ T i is a β -core-set ◮ Application to MapReduce and Streaming frameworks
Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9
Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio
Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces
Previous work Known β -composable core-sets for k -diversity maximization ([Indyk et al.’14,Aghamolaei et al.’15]) β α seq β · α seq Remote-edge 3 2 6 Remote-clique 6 + ǫ 2 12 + ǫ Remote-star 12 2 24 Remote-bipartition 18 3 54 Remote-tree 4 4 16 Remote-cycle 3 3 9 ◮ α seq = best sequential approximation ratio ◮ General metric spaces ◮ Core-set size: k
Computational Frameworks for Massive Data Analysis
Computational Frameworks for Massive Data Analysis MapReduce ◮ Targets distributed cluster-based architectures ◮ Computation: sequence of rounds where data are partitioned into (small) subsets, processed in parallel ◮ Goals: few rounds, sublinear local space, linear overall space.
Recommend
More recommend