Agglomerative 2-3 Hierarchical Agglomerative 2-3 Hierarchical Clustering: theoretical Clustering: theoretical improvements and tests improvements and tests Sergiu Chelcea 1 , Patrice Bertrand , Patrice Bertrand 1,2 1,2 , Brigitte Trousse , Brigitte Trousse 1 Sergiu Chelcea AxIS , INRIA Sophia-Antipolis, France , INRIA Sophia-Antipolis, France 1. Action 1. Action AxIS 2. ENST 2. ENST Bretagne Bretagne, France , France LastName LastName.FirstName@inria.fr .FirstName@inria.fr GfKl 2003 13 March 2003
Outline Outline • • The classical case of AHC The classical case of AHC • • 2-3 Hierarchies 2-3 Hierarchies • Definitions Definitions • Properties Properties • • Algorithm of 2-3AHC of 2-3AHC Algorithm • • Analysis of of complexity complexity Analysis • • Application on simulated simulated data data Application on • Experimental Experimental Validation of Validation of Complexity Complexity • Ongoing Ongoing and and Future Future Work Work GfKl 2003 13 March 2003 1
Context Context Bertrand 2002 Bertrand 2002 Hierarchies Hierarchies Diday Diday 1984-86, 1984-86, Fichet Fichet 1987 1987 2-3 Hierarchies 2-3 Hierarchies Pyramids Pyramids Weak Hierarchies Weak Hierarchies Bandelt, Dress 1989 Bandelt , Dress 1989 Diatta Diatta, Fichet Fichet 1994 1994 GfKl 2003 13 March 2003 2
Hierarchies (1/3) (1/3) Hierarchies We recall recall some some definitions definitions related related to to the the hierarchical hierarchical We case that that w ill w ill be be extended extended to to the the 2-3 2-3 hierarchies hierarchies: : case • Hierarchy: : - • Hierarchy - each each cluster cluster is is nonempty nonempty - - E E and and the the singletons are clusters singletons are clusters - - each each pair of clusters (A,B) pair of clusters (A,B) is is hierarchical hierarchical: B A ∩ B ∈ { ∅ ,A,B ,A,B} A 2 A 1 Remark : : - Remark - admits admits at at most most n-1 non trivial clusters n-1 non trivial clusters Indexed hierarchy Indexed hierarchy: : - - each each cluster cluster is associated to is associated to a positive a positive real number real number f f , , ∀ ∈ ⊂ ⇒ < A , B S , A B f ( A ) f ( B ) w here w here GfKl 2003 13 March 2003 3
Agglomerative Hierarchical Classification Agglomerative Hierarchical Classification (2/3) (2/3) Vocabulary: : Vocabulary - set inclusion - set inclusion order order on on the the set of clusters: set of clusters: - predecessor - predecessor/successor successor - comparable clusters - comparable clusters - candidate clusters (unmarked) = maximal clusters - candidate clusters (unmarked) = maximal clusters δ × → ∞ : E E [ 0 , ) - data input: dissimilarity - data input: dissimilarity δ = δ > δ = ∀ ∈ ( a , b ) ( b , a ) ( a , a ) 0 , a , b E µ : clusters), µ - aggregation - aggregation index ( index (link link betw een betw een clusters), : - single linkage - single linkage - complete - complete linkage linkage - average - average linkage linkage µ (X,Y) Y) = µ f(X ∪ Y) = - usually - usually f(X (X,Y) GfKl 2003 13 March 2003 4
Algorithm AHC (3/3) AHC (3/3) Algorithm 1. Initialisation Initialisation: : iter ← 0; Clusters are the singletons of set E. 1. iter 0; Clusters are the singletons of set E. f ← 0; f 0; 2. iter ← iter 2. iter iter + 1; + 1; µ - the tw o nearest X and Y w hich are - in the sense of µ Merge Merge X and Y w hich are - in the sense of - the tw o nearest clusters; compute f(X ∪ Y) clusters; compute f(X 3. Reduction Reduction: : Eliminate the successors found on the same 3. Eliminate the successors found on the same level level f w ith their predecessor, if there are any w ith their predecessor, if there are any µ Update µ 4. Update 4. , predecessor predecessor links, links, successor successor links links 5. Stopping Stopping rule rule: : Repeat step 2-4, until the set E becomes a 5. Repeat step 2-4, until the set E becomes a cluster cluster GfKl 2003 13 March 2003 5
2-3 Hierarchies Hierarchies: : Definitions Definitions 2-3 Proper intersection intersection: : Proper • • B, if A ∩ B ∉ { ∅ ,A,B - A - A properly properly intersects intersects B, if A ,A,B} A B Concept: - Concept: - in a 2-3 in a 2-3 hierarchy hierarchy, for , for any three any three clusters clusters at least tw o at least tw o pairs of them pairs of them are are hierarchical hierarchical • 2-3 Hierarchy Hierarchy [Bertrand 2002]: [Bertrand 2002]: • 2-3 - each - each cluster cluster is is nonempty nonempty - - E E and singletons are clusters and singletons are clusters - the - the proper proper intersection of intersection of tw o tw o clusters clusters is is also also a cluster a cluster - each - each cluster cluster properly properly intersects intersects no more no more than than one one other other cluster cluster GfKl 2003 13 March 2003 6
2-3 Hierarchies Hierarchies: : Properties Properties 2-3 [Bertrand 2002] [Bertrand 2002] • • The The number number of of elements elements of a 2-3 of a 2-3 hierarchy hierarchy that that are are 3 not reduced not reduced to to singletons, singletons, is is at at most most − ) ( n 1 2 • Each 2-3 2-3 hierarchical hierarchical set set system system on E on E is is a a • Each collection of intervals intervals of of some some linear linear order order collection of defined on E. on E. defined 2-3 Hierarchy 2-3 Hierarchy Pyramid Pyramid GfKl 2003 13 March 2003 7
Algorithm of 2-3AHC of 2-3AHC Algorithm ← 0; Clusters are the singletons of set E. 1. Initialisation Initialisation: : 1. iter iter 0; Clusters are the singletons of set E. f ← 0; f 0; ← iter 2. iter 2. iter iter + 1; + 1; µ - the tw o X and Y w hich are - in the sense of µ Merge X and Y w hich are - in the sense of Merge - the tw o nearest non-comparable nearest non-comparable clusters, such that at least clusters, such that at least one of them is maximal; compute f(X ∪ Y) one of them is maximal; compute f(X X ∪ Y and the other predecessor of X or Y, if it 3. Merge Merge X 3. and the other predecessor of X or Y, if it exists. exists. compute f(X ∪ Y) compute f(X 4. Reduction Reduction: : Eliminate the successors found on the same 4. Eliminate the successors found on the same level f level f w ith their predecessor, if there are any w ith their predecessor, if there are any µ µ 5. 5. Update Update , , predecessor predecessor links, links, successor successor links links 6. Stopping Stopping rule rule: : Repeat step 2-5, until the set E becomes a 6. Repeat step 2-5, until the set E becomes a cluster cluster GfKl 2003 13 March 2003 8
Algorithm of 2-3AHC of 2-3AHC Algorithm • Generalizes the the AHC: AHC: • Generalizes - a cluster - a cluster can can be be merged merged w ith w ith tw o tw o different different clusters clusters • • Double single linkage Double single linkage [ [Jullien Jullien, Bertrand 2002]: , Bertrand 2002]: ∪ = µ µ ∪ f ( X Y ) Min { ( X , Y ), ( X Y , Z ) : Z candidate cluster } • Complexity: : O(n • Complexity O(n 2 log log n) n) GfKl 2003 13 March 2003 9
Analysis of Complexity (1/3) Analysis of Complexity (1/3) We use an ordered dissimilarity matrix on three levels: We use an ordered dissimilarity matrix on three levels: - dissimilarity values - dissimilarity values - cardinality of the tw o clusters - cardinality of the tw o clusters - lexicographical order - lexicographical order Step 1. Step 1. Initialisation Initialisation: : Compute and order the dissimilarity Compute and order the dissimilarity matrix, O(n matrix, O(n 2 log log n) n) Step 2. Merge Merge X and Y … : Retrieve (X,Y) from the data structure, Step 2. X and Y … : Retrieve (X,Y) from the data structure, and create X ∪ Y, O(1) and create X Y, O(1) X ∪ Y and … : Intermediate merging w ith O(n) Step 3. Merge Merge X Step 3. and … : Intermediate merging w ith O(n) complexity complexity GfKl 2003 13 March 2003 10
Analysis of of Complexity Complexity (2/3) (2/3) Analysis Step 4. Reduction Reduction: : We have five possible cases of reduction Step 4. We have five possible cases of reduction w hen merging a cluster: w hen merging a cluster: α . α β 2 X’ β 2 Y’ β 2 Z β 1 - eliminate the successors found on the same level - eliminate the successors found on the same level w ith their predecessor w ith their predecessor - complexity O(n) - complexity O(n) GfKl 2003 13 March 2003 11
Analysis of of Complexity Complexity (3/3) (3/3) Analysis µ µ Step 5. Update Update : Step 5. - compute new dissimilarities and store them in - compute new dissimilarities and store them in the matrix, O(n the matrix, O(n log log n) n) - eliminate dissimilarities containing non candidates - eliminate dissimilarities containing non candidates clusters, O(n clusters, O(n log log n) n) Total complexity of the algorithm : Total complexity of the algorithm n) + n × O(n n) → O(n O(n 2 log O(n 2 log O(n log n) + n O(n log log n) log n) n) step 1. step 1. steps 2. - 5. steps 2. - 5. GfKl 2003 13 March 2003 12
Recommend
More recommend