1 analysis of high dimensional data
play

1 Analysis of high-dimensional data Theory Outline 2 Simultaneous - PowerPoint PPT Presentation

J ornsten. Simultaneous Subset selection via Rate-Distortion Theory Simultaneous Subset Selection via Rate-Distortion Theory Outline - with application to cluster and significance analysis Analysis of high-dimensional data of gene


  1. J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Theory Simultaneous Subset Selection via Rate-Distortion Theory Outline - with application to cluster and significance analysis Analysis of high-dimensional data of gene expression data Simultaneous Selection via Rate-distortion Theory Rebecka J¨ ornsten Cluster analysis Department of Statistics, Rutgers University Significance analysis rebecka@stat.rutgers.edu, http://www.stat.rutgers.edu/ ∼ rebecka Conclusion and Future work Biostatistics Day, April 25, 2008

  2. J¨ ornsten. Simultaneous Subset selection via Rate-Distortion 1 Analysis of high-dimensional data Theory Outline 2 Simultaneous Selection via Analysis of high-dimensional Rate-distortion Theory data Simultaneous Selection via Rate-distortion 3 Cluster analysis Theory Cluster analysis Significance 4 Significance analysis analysis Conclusion and Future work 5 Conclusion and Future work

  3. Analysis of high-dimensional data J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Theory Clustering Popular approach for dimension reduction Outline Wide range of applications: engineering, geological Analysis of high-dimensional data, social networks, high-throughput biology data *Assign gene function via ”guilt by association” Simultaneous Selection via *Suggestive of biological pathways and networks Rate-distortion Theory Cluster Multiple testing analysis Massive number of tests performed - how do we control Significance analysis the number (or proportion) of false rejections? Conclusion and Future work Problem is encountered in e.g. clinical trials with multiple end-points, fMRI analysis, and proteomics and genomics. *Identify a set of genes whose expression levels differ between a set of experimental conditions

  4. Thinking about the problems in terms J¨ ornsten. Simultaneous Subset of Model Selection selection via Rate-Distortion Theory Clustering Outline 1 How many clusters? Analysis of 2 Subset model selection: What is the most efficient high-dimensional data description of a cluster profile? Simultaneous Example: We want to objectively be able to state that a Selection via Rate-distortion cluster corresponds to a particular pattern across Theory Cluster experimental conditions (e.g. static). analysis Significance Multiple testing analysis Conclusion and 1 How many rejections? Future work 2 Subset model selection: For each rejected null-hypothesis, can we identify the alternative? Example: We want to identify the differentially expressed genes, and the discriminatory experimental conditions.

  5. Thinking about the problems in terms J¨ ornsten. Simultaneous Subset of Model Selection selection via Rate-Distortion Theory Why are these model selection tasks so Outline important? Analysis of high-dimensional 1 Reduce the reliance on subjective interpretations of the data analysis outcome. Simultaneous Selection via Rate-distortion ”This clusters seems to represent a static expression Theory profile.” Cluster analysis ”Selected genes appear to primarily represent Significance differential expression between only one of the analysis experimental factors.” Conclusion and Future work 2 Waste not - want not! Spend the parameter budget where it is needed. If we use efficient representations of simple data structures (e.g. static cluster profiles), we may detect more subtle structures.

  6. Challenges J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Theory Outline Clustering Analysis of high-dimensional Clustering and subset model selection are not separable. data Simultaneous The search for the optimal cluster subset models is Selection via Rate-distortion combinatorial in the number of clusters and Theory experimental conditions. Cluster analysis Significance Significance Analysis analysis Double multiplicity: multiple genes, and multiple model Conclusion and Future work classes for each gene. Model space is HUGE!

  7. Proposed strategy J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Theory Outline Analysis of Simultaneous subset selection via high-dimensional data rate-distortion theory Simultaneous Selection via Challenge: clustering and subset model selection are not Rate-distortion Theory separable tasks Cluster We appeal to results in rate-distortion theory to develop analysis a selection method that is simultaneous across clusters Significance analysis Generalizes to multiple testing. Conclusion and Future work

  8. Bit-allocation and Rate-Distortion J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Theory We will turn the combinatorial model selection problems into a simultaneous search using results from optimal bit Outline allocation in Rate-Distortion Theory. Analysis of high-dimensional data Simultaneous Selection via Selected Rate-distortion model Theory Cluster DISTORTION analysis Significance analysis Conclusion and Future work Slope constraint RATE Here: What is ”Rate”? What is ”Distortion”?

  9. Bit-allocation J¨ ornsten. Simultaneous Subset selection via - Consider data blocks (block=single gene, or gene cluster). Rate-Distortion Theory - For each data block k , model M results in a distortion D k ( M ), with rate R k ( M ) (e.g. # parameters p ( M )). Outline - How do we allocate model complexity to each block fairly? Analysis of high-dimensional data Simultaneous Selection via Selected Rate-distortion model Theory DISTORTION Cluster analysis Significance analysis Conclusion and Future work Slope constraint RATE RD theory: To minimize the overall distortion, (e.g. MSE), the optimal allocation is obtained at points of equal slope on the block-wise Rate-distortion curves.

  10. The Bit-allocation/Equal Slope J¨ ornsten. Simultaneous Subset principle selection via Rate-Distortion Theory Why does the equal slope constraint give the optimal allocation? Outline Analysis of high-dimensional data Selected Simultaneous model Selection via Rate-distortion DISTORTION Theory Cluster analysis Significance analysis Conclusion and Slope constraint Future work RATE For any other solution, there is at least one block-pair for which the rate-of-change of the distortion differs, and a better solution is obtained by re-allocating model complexity between these blocks.

  11. The Bit-allocation/Equal Slope J¨ ornsten. Simultaneous Subset principle selection via Rate-Distortion 1.0 Theory 0.8 0.6 Outline Analysis of 0.4 high-dimensional data 0.2 Simultaneous 0.0 Selection via 0.0 0.2 0.4 0.6 0.8 1.0 Rate-distortion 8bpp original Theory Cluster analysis 1.0 1.0 Significance 0.8 0.8 analysis Conclusion and 0.6 0.6 Future work 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fixed model .5bpp Optimal allocation .5bpp

  12. Motivating example J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Data: mRNA gene expression Theory Timecourse; 0, 1 and 3 days levels in two, divergent neural after a growth factor is stem cell lines (one becomes Outline blocked in the media neurons, the other Analysis of (initiates/speeds up predominantly glia). high-dimensional data proliferation). Simultaneous Gene cluster expression Selection via Rate-distortion profiles appear ”parallel”, Theory ”static”, ”diverging”... Cluster analysis glia neuron Significance 3 4 3 analysis 3 3 3 2 2 2 2 8 4 1 4 3 1 1 Conclusion and Log expression 1 8 4 1 4 6 Future work 1 2 5 2 8 7 5 6 1 4 3 2 6 0 6 5 8 6 8 5 8 −1 7 6 5 7 5 −2 7 7 7 −3 1 2 3 1 2 3 Time

  13. Model formulation J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Theory Outline For each gene g we observe a feature vector x g : Analysis of high-dimensional data x g | g in cluster k ∼ MVN ( µ k , Σ k ) Simultaneous Selection via Rate-distortion Theory We model each cluster profile µ k = W θ k , where W is Cluster analysis a design matrix that reflects the biological question Significance A sparse representation of µ k is obtained if we set some analysis Conclusion and of the parameters θ k to 0. Future work

  14. Model selection J¨ ornsten. Simultaneous Subset selection via Rate-Distortion Search strategies? Theory Classification EM (CEM), a step-wise approach Outline Cluster the data. 1 Perform separate model selection for each cluster. Analysis of 2 high-dimensional Update the clustering given the parameter constraints. 3 data Combinatorial search Simultaneous Selection via Cluster the data. Rate-distortion 1 Theory Iterate: 2 Cluster Consider reducing a cluster specific model by one analysis parameter. Significance analysis Select the cluster k for which the drop does the least ”damage”. Conclusion and Future work Stop whenever the BIC increases. 3 *CEM assumes clustering and cluster model selection are separable. *The combinatorial search is greedy and computationally intensive.

Recommend


More recommend