TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes Inuk Jung, Hongryul Ahn, Kyuri Jo, Hyejin Kang, Youngjae Yu and Sun Kim Inuk Jung (inukjung@snu.ac.kr) Bio and Health Informatics lab Seoul National University
Goal of this study Identify biologically meaningful gene clusters ( triclusters ) that have significantly similar or differential expression patterns from 3 dimensional time series data (Gene-Time-Condition) C
Goal of this study Example Organism: Mouse (18117 genes) Time points: day 0, day 3, day 7, day 14 289872 expression values (GxTxC) Conditions: Malaria infected intact female, gonadectomized * (gdx) female, intact male, gdx male Differentially Expressed Patterns (DEP) Similarly Expressed Pattern (SEP) 100 genes 80 genes 200 genes * Removementof ovaries or testis
Two technical problem statements 1. High clustering complexity by dimensions 2. Technical difficulty to capture differential expression patterns between two or more conditions (What are DEGs in time series data?)
P1. High clustering complexity by dimensions One dimension (C) Two dimensions (GT, or GC) DEG analysis used for Biclustering algorithm time series analysis developed for time [1] (2000) series data [2] (2000) Does not take into account Biclustering is NP-hard and the sequential nature of is bound to 2 dimensional time series expression data clustering (either gene-time or gene-condition) Three dimensions (GCT) Three dimensions (GCT) Triclustering tool that First triclustering is able to identify algorithm developed , DEPs [4] (2012) TriCluster [3] (2005) Identification process of DEP Only able to identify is based on similarity triclusters with similar measures – poor performance expression patterns (SEP) [1] Alizadeh et al, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 2000 [2] Cheng et al, Biclustering of expression data, ISMB2000 [3] Zhao et al, The Tricluster algorithm, ACM SIGMOD 2005 [4] Tchagang et al, The OPTricluster algorithm, BMC Bioinformatics 2012
P2. Capturing differential expression patterns between two or more conditions • Divergent pattern recognition is not available • Expression pattern differs between all patterns • OPTricluster performs a pairwise comparison for detecting divergent expression pattern clusters • In case of four conditions – A, B, C, D • A vs BCD, B vs ACD, C vs ABD, D vs ABC • Hence A!=B!=C!=D is not supported
TimesVector Framework Clustering Detecting patterns
Clustering – Dimension reduction • Dimension reduction by stripping away the sample dimension and concatenating it to the time dimension • Takes burden off of for clustering and post-processing procedures • No information is lost 3 dimensional matrix 2 dimensional matrix s 1 s 2 s k G × CT matrix … t 1 t 2 t 3 Concatenate G × C × T matrix t 1 t 2 t 3 t 1 t 2 t 3 t 1 t 2 t 3 25 23 22 t 1 t 2 t 3 samples t 1 t 2 t 3 5 g 1 15 20 10 15 10 5 48 25 23 22 Genes (i) g 1 15 20 10 12 17 g 2 39 52 31 35 22 12 55 52 48 … Genes g 2 39 52 31 1 … g 3 8 16 6 7 3 1 20 18 17 (i) 16 g 3 8 16 6 … … … … … … … … … … … 13 … g i 25 23 25 14 15 13 17 16 16 … … … g i 25 23 25 Time (j) ⋅ Conditions(k) Time (j)
Spherical K-means clustering • Spherical K-means (skmeans) for clustering the vectors • A K-means clustering algorithm with cosine similarity as its distance metric • Vectors are normalized to unit vectors – this causes projection of vectors to a sphere • Minimize the cosine dissimilarity in all clusters : total number of clusters : total number of genes : expression level vector of gene : indicator of a gene having membership to cluster : the centroid of cluster
Selecting K by silhouette score • Using four microarray and RNA-seq time-series data, the K with the highest silhouette score was chosen C × T Data C T K 700 GSE74465 (Rice) 2 3 6 100 600 GSE11651 (Yeast) 5 3 15 200 GSE4324 (Mouse) 4 4 16 500 500 GSE39429 (Rice) 4 6 24 600 Optimal K 400 300 200 100 : C × T 0 0 5 10 15 20 25 30 Condition × Time points
Detecting clusters with distinct expression patterns • Re-introduce condition dimension by splitting vectors by conditions • The bZIP gene vector is dissected into the number of conditions A B C D (0h, 1h 6h) (0h, 1h 6h) (0h, 1h 6h) (0h, 1h 6h) v(bZIP)=<1, 1, 1, 3, 3, 3, 3.5, 2.5, 3, 3.7, 2.2, 3> 1h Conditions A 3 6h B C D 2 centroid 1 0h 1 2 3 4 5
Three types of patterns are defined • DEP (Differentially Expressed Pattern) • All samples in a cluster have different expression patterns • ODEP (One Differentially Expressed Pattern) • One sample in a cluster have different expression from the others • SEP (Similarly Expressed Pattern) • All samples have similar expression pattern in a cluster
Method – DEP pattern recognition • Objective: Test if expression of conditions A, B, C are A!=B!=C • Build centroid for each condition within each cluster • Select the most outer centroid as base centroid cluster C3 1h A A centroid 3 6h B B centroid C C centroid cluster C1 2 cluster C2 1 0h 1 2 3 4 5
Method – DEP pattern recognition 1. Compute cosine distance from each dissected vector to the base centroid for each cluster 2. Rank dissected vectors by cosine distance 3. Measure Mutual Information with X as distance to base centroid and Y as condition 4. Measure significance of MI by 1000 random permutated tests, 1h A A centroid 3 6h B B centroid C C centroid Base centroid 2 cluster C2 1 0h 1 2 3 4 5 Phenotype A A A A B B B B C C C C clid G1_A G2_A G3_A G4_A G1_B G2_B G3_B G4_B G1_C G2_C G3_C G4_C C2 0.9 0.87 0.96 0.99 0.1 0.05 0.2 0.18 0.5 0.6 0.57 0.61 Rank 10 9 11 12 2 1 4 3 5 7 6 8 Discretized 3 3 3 3 1 1 1 1 2 2 2 2 Rank MI Log 2 (4)=2
Method – ODEP pattern recognition Objective: Test if expression pattern of a condition among A, B, C is A!=BC (B=C) or B!=AC (A=C) or C!=AB (A=B) 1. Compute a base centroid of comparing conditions (BC, AC, AB) 2. Compute cosine distance of dissected vectors to the centroid for each combination 3. Perform ANOVA on the computed cosine distance combinations cluster C3 1h A A centroid 3 6h B B centroid C C centroid cluster C1 2 cluster C2 1 0h 1 2 3 4 5
Method – SEP pattern recognition Objective: Test if expression of conditions A, B, C is A=B=C 1. Compute a base centroid of all conditions within a cluster 2. Compute cosine distance of dissected vectors to the base centroid 3. Tightness - lower bound of 99% confidence interval of all clusters 4. Clusters with tightness less than 99% CI are SEP clusters cluster C3 1h A A centroid 3 6h B B centroid C C centroid cluster C1 2 cluster C2 1 0h 1 2 3 4 5
Results • Data Malaria infected / Gonadectomized male and female mice Rice plants treated with 4 phytohormones Dehydration stress treated rice plants * Fermentation of five yeast strains • Biologically significant clusters detected • Performance compared with Tricluster and OPTricluster
Results – Cluster patterns C=4, T=4 C=4, T=6
Results – Malaria infected Mouse data (a) DEP cluster 51 (b) ODEP cluster 20 (c) SEP cluster 357
Results – Phytohormone treated rice plants • 5 clusters were found that responded to the ABA (Absicic acid) phytohormone • Genes were gradually induced over time. • Enriched GO terms in these clusters were related to ‘Response to abscisic acid’
Results – Comparison with other tools Tightness Average number of Weighted silhouette (average within cosine genes per cluster score distance of clusters)
Conclusion • TimesVector is able to detect gene clusters in 3D time-series data that exhibit distinct expression patterns • Especially, it is able to detect clusters with distinctively different expression patterns across conditions • It showed significantly improved clustering quality compared to recent triclusteringtools
Funding • The Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01121102) Rural Development Administration ( RDA ), Republic of Korea • The Bio & Medical Technology Development Program of the National Rese arch Foundation ( NRF ) funded by the Ministry of Science, ICT & Future Planning (2012M3A9D1054622) • The Korea Health Technology R&D Project through the Korea Health Industry Development Institute ( KHIDI ), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224)
Thank you for your attention
Recommend
More recommend