TimesVector: A vectorized clustering approach to the analysis of - PowerPoint PPT Presentation

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes Inuk Jung, Hongryul Ahn, Kyuri Jo, Hyejin Kang, Youngjae Yu and Sun Kim Inuk Jung (inukjung@snu.ac.kr) Bio and Health Informatics lab Seoul National University

Goal of this study Identify biologically meaningful gene clusters ( triclusters ) that have significantly similar or differential expression patterns from 3 dimensional time series data (Gene-Time-Condition) C

Goal of this study Example Organism: Mouse (18117 genes) Time points: day 0, day 3, day 7, day 14 289872 expression values (GxTxC) Conditions: Malaria infected intact female, gonadectomized * (gdx) female, intact male, gdx male Differentially Expressed Patterns (DEP) Similarly Expressed Pattern (SEP) 100 genes 80 genes 200 genes * Removementof ovaries or testis

Two technical problem statements 1. High clustering complexity by dimensions 2. Technical difficulty to capture differential expression patterns between two or more conditions (What are DEGs in time series data?)

P1. High clustering complexity by dimensions One dimension (C) Two dimensions (GT, or GC) DEG analysis used for Biclustering algorithm time series analysis developed for time [1] (2000) series data [2] (2000) Does not take into account Biclustering is NP-hard and the sequential nature of is bound to 2 dimensional time series expression data clustering (either gene-time or gene-condition) Three dimensions (GCT) Three dimensions (GCT) Triclustering tool that First triclustering is able to identify algorithm developed , DEPs [4] (2012) TriCluster [3] (2005) Identification process of DEP Only able to identify is based on similarity triclusters with similar measures – poor performance expression patterns (SEP) [1] Alizadeh et al, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 2000 [2] Cheng et al, Biclustering of expression data, ISMB2000 [3] Zhao et al, The Tricluster algorithm, ACM SIGMOD 2005 [4] Tchagang et al, The OPTricluster algorithm, BMC Bioinformatics 2012

P2. Capturing differential expression patterns between two or more conditions • Divergent pattern recognition is not available • Expression pattern differs between all patterns • OPTricluster performs a pairwise comparison for detecting divergent expression pattern clusters • In case of four conditions – A, B, C, D • A vs BCD, B vs ACD, C vs ABD, D vs ABC • Hence A!=B!=C!=D is not supported

TimesVector Framework Clustering Detecting patterns

Clustering – Dimension reduction • Dimension reduction by stripping away the sample dimension and concatenating it to the time dimension • Takes burden off of for clustering and post-processing procedures • No information is lost 3 dimensional matrix 2 dimensional matrix s 1 s 2 s k G × CT matrix … t 1 t 2 t 3 Concatenate G × C × T matrix t 1 t 2 t 3 t 1 t 2 t 3 t 1 t 2 t 3 25 23 22 t 1 t 2 t 3 samples t 1 t 2 t 3 5 g 1 15 20 10 15 10 5 48 25 23 22 Genes (i) g 1 15 20 10 12 17 g 2 39 52 31 35 22 12 55 52 48 … Genes g 2 39 52 31 1 … g 3 8 16 6 7 3 1 20 18 17 (i) 16 g 3 8 16 6 … … … … … … … … … … … 13 … g i 25 23 25 14 15 13 17 16 16 … … … g i 25 23 25 Time (j) ⋅ Conditions(k) Time (j)

Spherical K-means clustering • Spherical K-means (skmeans) for clustering the vectors • A K-means clustering algorithm with cosine similarity as its distance metric • Vectors are normalized to unit vectors – this causes projection of vectors to a sphere • Minimize the cosine dissimilarity in all clusters : total number of clusters : total number of genes : expression level vector of gene : indicator of a gene having membership to cluster : the centroid of cluster

Selecting K by silhouette score • Using four microarray and RNA-seq time-series data, the K with the highest silhouette score was chosen C × T Data C T K 700 GSE74465 (Rice) 2 3 6 100 600 GSE11651 (Yeast) 5 3 15 200 GSE4324 (Mouse) 4 4 16 500 500 GSE39429 (Rice) 4 6 24 600 Optimal K 400 300 200 100 : C × T 0 0 5 10 15 20 25 30 Condition × Time points

Detecting clusters with distinct expression patterns • Re-introduce condition dimension by splitting vectors by conditions • The bZIP gene vector is dissected into the number of conditions A B C D (0h, 1h 6h) (0h, 1h 6h) (0h, 1h 6h) (0h, 1h 6h) v(bZIP)=<1, 1, 1, 3, 3, 3, 3.5, 2.5, 3, 3.7, 2.2, 3> 1h Conditions A 3 6h B C D 2 centroid 1 0h 1 2 3 4 5

Three types of patterns are defined • DEP (Differentially Expressed Pattern) • All samples in a cluster have different expression patterns • ODEP (One Differentially Expressed Pattern) • One sample in a cluster have different expression from the others • SEP (Similarly Expressed Pattern) • All samples have similar expression pattern in a cluster

Method – DEP pattern recognition • Objective: Test if expression of conditions A, B, C are A!=B!=C • Build centroid for each condition within each cluster • Select the most outer centroid as base centroid cluster C3 1h A A centroid 3 6h B B centroid C C centroid cluster C1 2 cluster C2 1 0h 1 2 3 4 5

Method – DEP pattern recognition 1. Compute cosine distance from each dissected vector to the base centroid for each cluster 2. Rank dissected vectors by cosine distance 3. Measure Mutual Information with X as distance to base centroid and Y as condition 4. Measure significance of MI by 1000 random permutated tests, 1h A A centroid 3 6h B B centroid C C centroid Base centroid 2 cluster C2 1 0h 1 2 3 4 5 Phenotype A A A A B B B B C C C C clid G1_A G2_A G3_A G4_A G1_B G2_B G3_B G4_B G1_C G2_C G3_C G4_C C2 0.9 0.87 0.96 0.99 0.1 0.05 0.2 0.18 0.5 0.6 0.57 0.61 Rank 10 9 11 12 2 1 4 3 5 7 6 8 Discretized 3 3 3 3 1 1 1 1 2 2 2 2 Rank MI Log 2 (4)=2

Method – ODEP pattern recognition Objective: Test if expression pattern of a condition among A, B, C is A!=BC (B=C) or B!=AC (A=C) or C!=AB (A=B) 1. Compute a base centroid of comparing conditions (BC, AC, AB) 2. Compute cosine distance of dissected vectors to the centroid for each combination 3. Perform ANOVA on the computed cosine distance combinations cluster C3 1h A A centroid 3 6h B B centroid C C centroid cluster C1 2 cluster C2 1 0h 1 2 3 4 5

Method – SEP pattern recognition Objective: Test if expression of conditions A, B, C is A=B=C 1. Compute a base centroid of all conditions within a cluster 2. Compute cosine distance of dissected vectors to the base centroid 3. Tightness - lower bound of 99% confidence interval of all clusters 4. Clusters with tightness less than 99% CI are SEP clusters cluster C3 1h A A centroid 3 6h B B centroid C C centroid cluster C1 2 cluster C2 1 0h 1 2 3 4 5

Results • Data Malaria infected / Gonadectomized male and female mice Rice plants treated with 4 phytohormones Dehydration stress treated rice plants * Fermentation of five yeast strains • Biologically significant clusters detected • Performance compared with Tricluster and OPTricluster

Results – Cluster patterns C=4, T=4 C=4, T=6

Results – Malaria infected Mouse data (a) DEP cluster 51 (b) ODEP cluster 20 (c) SEP cluster 357

Results – Phytohormone treated rice plants • 5 clusters were found that responded to the ABA (Absicic acid) phytohormone • Genes were gradually induced over time. • Enriched GO terms in these clusters were related to ‘Response to abscisic acid’

Results – Comparison with other tools Tightness Average number of Weighted silhouette (average within cosine genes per cluster score distance of clusters)

Conclusion • TimesVector is able to detect gene clusters in 3D time-series data that exhibit distinct expression patterns • Especially, it is able to detect clusters with distinctively different expression patterns across conditions • It showed significantly improved clustering quality compared to recent triclusteringtools

Funding • The Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01121102) Rural Development Administration ( RDA ), Republic of Korea • The Bio & Medical Technology Development Program of the National Rese arch Foundation ( NRF ) funded by the Ministry of Science, ICT & Future Planning (2012M3A9D1054622) • The Korea Health Technology R&D Project through the Korea Health Industry Development Institute ( KHIDI ), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224)

Thank you for your attention

TimesVector: A vectorized clustering approach to the analysis of - PowerPoint PPT Presentation

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes Inuk Jung, Hongryul Ahn, Kyuri Jo, Hyejin Kang, Youngjae Yu and Sun Kim Inuk Jung (inukjung@snu.ac.kr) Bio and Health

Estimation based based on on vectorized vectorized surfaces surfaces Estimation for for

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

SIMD Vectorized Hashing for Grouped Aggregation Bala Gurumurthy, David Broneske, Marcus Pinnecke,

ADVANCED DATABASE SYSTEMS Vectorized Execution @ Andy_Pavlo // 15- 721 // Spring 2019 CMU

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Hypothyroidism Therapeutics 1. The metabolically active thyroid hormone is _____________.

STERILITY TEST (ST) Centre for Quality Control National Pharmaceutical Control Bureau Lot 36,

st rtrs

C.R.A.P . RULES - According to The Non-Designers Design Book by Robin Williams What to do

Financial disclosure Netra Systems, Inc. Pearls on Angle Assessment Pearls on Angle

Deep Residual Output Layers for Neural Language Generation Nikolaos Pappas, James Henderson June

Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation

Analysis of variance and regression November 13, 2007 SAS graphics Scatter plots

Sambuz

Useful Links

Newsletter

Mail Us