on the approximability of information theoretic clustering
play

On the Approximability of Information Theoretic Clustering - PowerPoint PPT Presentation

1 H(X) 0,75 0,5 0,25 0 0,25 0,5 0,75 1 On the Approximability of Information Theoretic Clustering Ferdiando Cicalese, U. Verona Eduardo Laber, PUC-RIO Lucas Murtinho, PUC-RIO POSTER 165, Pacific Ballroom Impurity Measures Maps a


  1. 1 H(X) 0,75 0,5 0,25 0 0,25 0,5 0,75 1 On the Approximability of Information Theoretic Clustering Ferdiando Cicalese, U. Verona Eduardo Laber, PUC-RIO Lucas Murtinho, PUC-RIO POSTER 165, Pacific Ballroom

  2. Impurity Measures • Maps a vector v in R d into a non-negative value • The more homogeneous v with respect to its components the larger the impurity – (1,0,0,19): small impurity – (5,5,5,5) : large impurity • Well known impurity measures g log k v k 1 v i Entropy X I Ent ( v ) = k v k 1 , k v k 1 v i i =1 g ✓ ◆ v i v i Gini X I Gini ( v ) = k v k 1 1 � k v k 1 k v k 1 i =1

  3. Clustering with minimum impurity Input • V : set of non-negative vectors in R d • I : impurity measure • k : number of clusters Goal P Partition V into k groups partition P = ( V (1) , . . . , V ( k ) ) so that impurity of a partition P then I ( P ) = P k i =1 I ( V ( i ) ) . the minimum possible impurity is minimized P P : impurity of the sum of the vectors in =1 I ( V ( i ) ) . I ( V ( i ) ) possible impurity possible impurity

  4. Applications/ Motivations • Generalizes clustering using KL-divergence – Entropy impurity and KL-divergence of a clustering differ by a constant factor • Clustering probability distributions • Clustering nominal attributes in decision tree/ random forest construction • Channel Quantizer Design [Inf. Theory]

  5. Our Contributions Approximation Algorithms • 3-approximation for Gini in linear time (arbitrary k) • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with approximation independent of n that does make assumption on the input domain

  6. Our Contributions Approximation Algorithms Project vectors in dimension k incur small additive loss • 3-approximation for Gini in linear time (arbitrary k) • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with approximation independent of n that does make assumption on the input domain

  7. Our Contributions Approximation Algorithms Project vectors in dimension k incur small additive loss • 3-approximation for Gini in linear time (arbitrary k) Each cluster is pure : all vectors have the same largest component • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with approximation independent of n that does make assumption on the input domain

  8. Our Contributions Approximation Algorithms Project vectors in dimension k incur small additive loss • 3-approximation for Gini in linear time (arbitrary k) Each cluster is pure : all vectors have the same largest component • O(log 2 (min{d,k}))- approximation for Entropy in polytime – First algorithm with There is a clustering with exactly one non-pure cluster and impurity approximation independent of O(log 2 d) ・ OPT n that does make assumption Find this clustering in a 2-dim projection using DP on the input domain

  9. Our Contributions APX-Hardness for Entropy • Reduction from c-gap vertex cover in cubic graphs • Solves open question from [Chaudhuri and McGregor, COLT08] and [Ackermann et al., ECCC11]

  10. Our Contributions APX-Hardness for 0..010 … 010 ... 00 Entropy Edges to vectors with two 1’s 0..000 … 010 ... 01 • Reduction from c-gap Theorem vertex cover in cubic k’(G,k) = 3 log 3|E|+6(1-log3)k graphs MinVertexCover ≤ k ⇒ Opt-Impurity ≤ k’(G,k) • MinVertexCover > ck ⇒ Opt-Impurity > c’k’(G,k) • • Solves open question from [Chaudhuri and McGregor, COLT08] and [Ackermann et al., ECCC11]

  11. Our Contributions APX-Hardness for 0..010 … 010 ... 00 Entropy Edges to vectors with two 1’s 0..000 … 010 ... 01 • Reduction from c-gap Theorem vertex cover in cubic k’(G,k) = 3 log 3|E|+6(1-log3)k graphs MinVertexCover ≤ k ⇒ Opt-Impurity ≤ k’(G,k) • MinVertexCover > ck ⇒ Opt-Impurity > c’k’(G,k) • • Solves open question Lemma. G cubic and min-VertexCover <= k from [Chaudhuri and ⇒ G decomposes into stars of sizes 2 and 3. McGregor, COLT08] and [Ackermann et al., ECCC11]

  12. Our Contributions Ratio-Greedy Algorithm • Built on top of the theoretical ideas • Promising preliminary experimental comparisons – much faster than a k-means based method – close impurity

  13. Our Contributions Ratio-Greedy Algorithm • Built on top of the theoretical ideas • Promising preliminary experimental comparisons – much faster than a k-means based method – close impurity

  14. Our Contributions Ratio-Greedy Algorithm • Built on top of the theoretical ideas • Promising preliminary experimental comparisons – much faster than a k-means based method – close impurity

Recommend


More recommend