Cluster Subspace Identification Via Conditional Entropy Calculations James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu
Outline � Subspace identification - why? � Conditional entropy and clusters in R 2 . � Ordering dimensions for easy subspace visualization and identification. � Maximal cliques lead to automatic subspace identification.
Subspace identification � Initial, high-level exploration of complex data can inform downstream analyses. � Explore samples (observations) or genes (dimensions) depending on intent. � Cluster structure in patients may only be revealed on a subset of genes (and vice- versa) (Getz el at ). � Uninformed feature selection can discard informative features.
Conditional entropy and clusters in R 2 � Use of conditional entropy gives us: � Distribution-free � Robust to outliers/extreme values � Minimal nuisance parameters � Robust to noise as long as the noise exists in all subspaces. � Adapted from a method proposed by Guo et al at the Geography department at Penn State. Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003
Geography to … Microarrays? � Guo et al have data with many (~10,000) observations in a few (~50) dimensions (measurements): “Dim” Obs. Obs. Dim. Dim. “Obs” � We have the opposite problem; we have many more ‘dimensions’ – genes – than we do observations – ‘samples’ or ‘patients’ – on those dimensions. � We flip Guo’s method on its ear – pretend that observations are dimensions and vice-versa.
The method n s n r n s Nested Means Matrix n g n s Minimal Spanning MST Order Tree n s CE Distance Matrix Clique Discovery Cliques Gene Expression Data
CE – what are we looking for?
Nested means discretization Resistant to extreme outliers not seen in an equal-interval approach. � We calculate nested mean vectors by: � Calculate the mean value of a dimension. � Divide the data into two halves on this mean. � Recursively divide each half into half again, calculating a vector of � ‘nested mean’ boundaries. Stop once we have the ‘required’ number of intervals (denoted r ). � We want enough intervals so that, on average, each cell contains � ~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals): 2 ≈ n / r 35 Example: For n = 10,000, r = 16 because 16*16 and is 256 and 256*35 = 8960 < 10,000. = k r 2
The method n s n r n s Nested Means Matrix n g n s Minimal Spanning MST Order Tree n s CE Distance Matrix Clique Discovery Cliques Gene Expression Data
Calculating CE � For every pair of dimensions (X and Y), discretize the 2D sub-space (using the nested means intervals); each cell is then represented in a table by the number of observations that fall in that cell. � Calculate entropy for every row and column; weight each by the row or column sum divided by the total number of observations. � Add up weighted row and column entropy values to get CE(Y|X) and CE(X|Y). The maximum of these two values is the final cluster tendency measure.
Calculating CE ∑ ∈ = − χ H ( C ) [ d ( x ) log d ( x )] log χ x X1 X2 X3 X4 X5 X6 Sum Wt CE X1 0 1 3 0 0 0 4 .03 .314 X2 1 9 1 0 1 2 14 .09 .629 X3 7 14 3 7 6 0 37 .25 .835 X4 7 6 13 19 12 5 62 .41 .939 X5 0 4 14 5 1 1 25 .17 .668 X6 1 2 3 2 0 0 8 .05 .737 CE(X|Y) .812 Sum 16 36 37 33 20 8 Wt .11 .24 .25 .22 .13 .05 CE(Y|X) CE max .700 .812 CE .597 .847 .806 .615 .540 .502 150 total values, r = 6 intervals example taken from Guo et al
The method n s n r n s Nested Means Matrix n g n s Minimal Spanning MST Order Tree n s CE Distance Matrix Clique Discovery Cliques Gene Expression Data
Graph-theoretic analysis � CE calculation results in a distance matrix - visualizing the fully-connected graph is of little use. � We can use graph theory to answer two questions: � Topologically, is there a linear order that, when sorted and imaged, can reveal cluster structure? � What fully-connected sub-graphs (cliques) exist in my data?
Sample ordering – the MST � A minimum spanning tree (MST) is a spanning tree, but has weights or lengths associated with the edges, and the total weight of the tree (the sum of the weights of its edges) is at a minimum. � We can use the topological ordering of the MST to create a relative ordering of our samples. Sorting the samples in this way in a data image can reveal structure. � We used Kruskal’s algorithm in the RBGL R library ( mstree.kruskal() ) – a greedy approach to generate an MST.
Use of the MST to Induce Orderings on the Dimensions • similar to UPGMA tree-building • the linear ordering can be viewed as a 1D compression of the resulting hierarchical tree
MST orderings on the image of the CE values � After ordering the samples according to their MST order, use of R’s image() method can generate the image at right. � This ordering can show us formerly-hidden cluster structure without any presupposition.
Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph � If we can see cluster structure, can we retrieve it in an automatic fashion? � On the fully-connected graph, break all edges longer than a threshold distance (somewhat subjective; varies between data sets).
Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph � On the resulting graph, find all cliques (fully- connected node sets). � Dr. Marchette – graph library’s clique() � Future work: a more efficient method is required.
Implementation details � Nested means discretization and calculation of conditional entropy written in R � MST ordering and dot files (our graph format of choice) written in Perl � Graphs visualized using AT&T’s Graphviz � All input and output files are tab-delimited ASCII text
Anecdotal Results
Artificial Data Set � 1000 observations in R 100 distributed N(0,1) in each of the variates � Observations 1-250 translated by + 3 in dimensions {5,6,7,8} � Observations 251-500 translated by –3 in dimensions {24,25,26,27,28,29,30} � Observations 501-750 translated by +5 in dimensions {55,56,57,58,59,60,61,62,63,64,65,66,67} � Observations 751-1000 translated by –5 in dimensions {10,11,12,13,14}
Artificial dataset results - MST
Image of Sorted CE Values for the Artificial Dataset
Golub dataset � An experiment to determine the ability of microarray data to separate acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). � Custom microarray, 7,129 genes � 72 samples � 47 ALL samples (both B- and T-cell) � 25 AML samples T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)
Golub Dataset - MST • AML samples • ALL samples
Image of Sorted CE Values for the Golub Dataset
ALL data set � Acute lymphoblastic leukemia B and T-cell data set contributed to Bioconductor by the Dana Farber Cancer Institute. � Affymetrix U95Av2 chip, 12,625 genes � 128 samples � 95 B-cell samples � 33 T-cell samples
ALL - MST • B-cell samples • T-cell samples
Image of Sorted CE Values for the ALL Dataset
Summary/Conclusions � An informative technique for initial high-level data exploration � Future direction: � Concretely determine sensitivity to noise � Develop a visualization tool for the MST ordering � A more efficient clique-discovery method
References Cheng, C., A. Fu, and Y. Zhang. Entropy-based subspace clustering for � mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999) Getz, G., Levine, E., Domany E. Coupled two-way clustering analysis � of gene microarray data. PNAS. 97:22, 12079. (2000). Guo, D. et al. Breaking Down Dimensionality: Effective and Efficient � Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date] Guo, D., D. Peuquet and M. Gahegan (2002). Opening the Black � Box: Interactive Hierarchical Clustering for Multivariate Spatial Patterns. The 10 th ACM International Symposium on Advances in Geographic Information Systems, McLean, VA, USA.
Recommend
More recommend