A Combinatorial Approach to the Analysis of Differential Gene Expression Data The Use of Graph Algorithms for Disease Prediction and Screening
The Goal • To classify patients based on expression profiles – Presence of cancer – Type of cancer – Response to treatment • To identify the genes required for accurate classification – Too many = unnecessary noise – Too few = insufficient information
Classic Clustering Problem • Current techniques: – Hierarchical Clustering – K-Means Clustering – Self-Organizing Maps – Others • Drawbacks: – Determining cluster boundaries difficult with diffuse data – Objects can only belong to one group
Algorithmic Training Raw Data Gene Scoring Dominating Set Eliminate Poorly Eliminate Poorly Discriminating Genes Covering Genes Calculate Sample Similarities Apply Threshold Verify by Classification Maximal Cliques Set of Discriminatory Gene Scores Genes
Algorithmic Training Raw Data Eliminate Poorly Discriminating Genes
The Gene Scoring Function: Identifying Discriminators vs. 0 2 4 6 8 10 0 2 4 6 8 score ( gene i ) = m classA − m classB − σ classA +σ classB
Algorithmic Training Raw Data Eliminate Poorly Eliminate Poorly Discriminating Genes Covering Genes
Eliminate Poorly Covering Genes Samples Genes Class 2 Class 1
Algorithmic Training Raw Data Eliminate Poorly Eliminate Poorly Discriminating Genes Covering Genes Calculate Sample Similarities Apply Threshold
Create Unweighted Graph • Complete, edge-weighted graph – Vertices = samples – Edge weight = similarity metric • Remove edge weights – If edge weight < threshold, remove edge from graph – Otherwise, keep edge, ignore weight • Result: incomplete unweighted graph
The Edge Weight Function [ ] ∑ score ( gene i ) • (1 − expression_value ij − expression_value ik ) where, expression value ij = expression value of gene i for sample j
Algorithmic Training Raw Data Eliminate Poorly Eliminate Poorly Discriminating Genes Covering Genes Calculate Sample Similarities Apply Threshold Verify by Classification Set of Discriminatory Gene Scores Genes
What is a Clique? • A completely connected subset of vertices in a graph • Maximal clique = local optimization • NP-complete
Classification Using Clique GRAPH Class 2 Class 1 Class 1 Class 3 Class2
A Selection of Discriminators ADH1B alcohol dehydrogenase IB alcohol dehydrogenase activity FHL1 four and a half LIM domains 1 cell growth, cell differentiation HBB hemoglobin, beta oxygen transport CYP4B1 cytochrome P450 4B1 electron transport TNA tetranectin plasminogen binding protein TGFBR2 transforming growth factor, beta transmembrane receptor receptor II protein serine/threonine kinase signaling pathway
The Algorithm - Unsupervised Raw Data Set of Discriminatory Genes, Scores Calculate Sample Similarities Apply Threshold Classify Unknown Samples
Summary • Intersection of clique and dominating set techniques improves results • Combined orthogonal scoring identifies limited number of discriminatory genes • Clique offers means of validating obtained scores and weights • Our technique identifies differing set of discriminatory genes from original paper • Clique-based classification a viable complement to present clustering methods
Ongoing and Future Research • Reverse Training • Train to distinguish among types of cancer • Experiment with different weight functions (ex. Pearson’s coefficient) • Investigate using less stringent techniques – Near-cliques – Neighborhood search – K-dense subgraphs • Port codes to SGI Altix supercomputer
Our Research Group Mike Langston, Ph. D. Lan Lin Chris Symons Xinxia Peng Bing Zhang, Ph. D.
Recommend
More recommend