Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA
Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD “Parallel and Scalable Bioinformatics”, April 2020 Stuart Byma James Larus, EPFL 2
What’s a protein? 3 James Larus, EPFL § Linear polymer of amino acids • Fold into complex 3D structures Protein Clustering: Parallelizing an Expensive, Irregular Computation § Perform many biological functions
Central dogma of molecular 4 James Larus, EPFL biology DNA RNA § Gene Expression Transcription • DNA à Protein Protein Clustering: Parallelizing an Expensive, Irregular Computation Translation Protein § Encoded by genes in genome § 19,000 – 20,000 proteins in humans • 1.5% of human genome § Composed of 20 amino acids
Protein Clustering: Parallelizing an Expensive, Irregular Computation Transcription § Transcribe DNA to RNA inside the nucleus James Larus, EPFL 5
Protein Clustering: Parallelizing an Expensive, Irregular Computation Translation § Once in cytoplasm, mRNA is translated to polypeptide https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg James Larus, EPFL 6
Protein Clustering: Parallelizing an Expensive, Irregular Computation § Polypeptides fold spontaneously, or are assisted by chaperone proteins Folding James Larus, EPFL 7
Proteins & evolution 8 James Larus, EPFL § Homologous – similar due to shared ancestry § Ortholog – similar proteins diverged through speciation Protein Clustering: Parallelizing an Expensive, Irregular Computation § Similarities between proteins are proxies for similarities between genes • Infer function of new protein because of its similarity to known protein § Extrapolation from small number of model organisms • Infer evolutionary relationships between species § X evolved from Y § X, Y have common ancestor § Several of 100 most-cited scientific papers are sequence homology
Sequence homology 9 James Larus, EPFL Human (Homo Sapiens) Bonobo (Pan Paniscus) Alignment showing protein similarity between hemoglobin α-subunits from human and bonobo proteins Protein Clustering: Parallelizing an Expensive, Irregular Computation
Identifying similar 10 James Larus, EPFL proteins § Input à sequenced proteins § Output à sets of homologous proteins Protein Clustering: Parallelizing an Expensive, Irregular Computation § All-against-all comparison • O(n 2 ) in number of sequences • Sequence comparison also O(n 2 ) in length of sequences (Smith-Waterman) § OMA protein database contains proteins from 2000 genomes • Required more than 10 million CPU hours
Improvement needed! 11 James Larus, EPFL “ Computing orthologs between all complete proteomes has recently gone from typically a Protein Clustering: Parallelizing an Expensive, Irregular Computation matter of CPU weeks to hundreds of CPU years, and new, faster algorithms and methods are called for. ” – Quest for Orthologs consortium, 2014
Incremental greedy 12 James Larus, EPFL protein clustering § Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology , PeerJ , 2014, Wittwer, Pilizota, Altenhoff, Dessimoz. Protein Clustering: Parallelizing an Expensive, Irregular Computation § Cluster similar proteins, then perform all-against-all comparison within each cluster § Reduces computation time by ~75% § Identify >99.6% of pairs found by all-vs-all Cluster 0 … Cluster N
Cluster representative 13 James Larus, EPFL § Input sequences compared against a cluster representative • Homologies are transitive Protein Clustering: Parallelizing an Expensive, Irregular Computation § A, B homologous; B, C homologous è A, C homologous § No matches? Create a new cluster! Cluster Cluster Cluster … R 1 R n R n+1 S 1 … S m S 1 … S m S 1
Protein Clustering: Parallelizing an Expensive, Irregular Computation Proteins not transitive James Larus, EPFL 14
Clustering, v2 15 James Larus, EPFL § Multiple representatives § Ensure all sequences in a cluster are covered (± T residues) Protein Clustering: Parallelizing an Expensive, Irregular Computation Cluster R 1 … R n S 1 … S m >> n
Incremental greedy 16 James Larus, EPFL protein clustering § Reduction in computation time of ~75% • Clusters are small, on average Protein Clustering: Parallelizing an Expensive, Irregular Computation § Accuracy is excellent • Maintain >99.6% of all pairs identified by all-against-all (naive)
But, 17 James Larus, EPFL § Algorithm is not easily parallelized Protein Clustering: Parallelizing an Expensive, Irregular Computation § Order in which clusters and representatives are chosen affects result § Data (clusters) is shared – difficult to distribute
Our approach: 18 James Larus, EPFL precise clustering § Precise clustering (PC) • All significant pairs are members of at least one cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation • Compare within cluster and find similarity § A pair of proteins is significant if their similarity is above a threshold • 𝑔 𝑞 1 , 𝑞 2 > 𝑈 § PC is not a partition – a protein can be in more than one cluster • Relation 𝑔 is not transitive, i.e. similarity is not equivalence p 3 p 2 p 2 p 4 p 1
Cluster representative 19 James Larus, EPFL § Each cluster has a unique representative R C • ∀e ∈ C, f (e, R C ) > T Protein Clustering: Parallelizing an Expensive, Irregular Computation § Two elements in cluster may not be similar: e 1 , e 2 ∈ C ⊬ f (e 1 , e 2 ) > T p 1 p 2
Approach 1 20 James Larus, EPFL § New element e is compared against cluster representatives • If similar, e is added to cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation § This does not work! • e, other than representative, will not be compared against subsequent elements • Because f is not transitive, clustering will not be precise – may miss matches p 5 p 3 p 2 p 2 p 4 p 1
Transitive similarity 21 James Larus, EPFL § Transitivity R( R(e 1 , , e 2 ) ) implies e 2 will be similar to e 3 if e 1 is similar to e 3 Protein Clustering: Parallelizing an Expensive, Irregular Computation • ∀ 𝑗, 𝑘, 𝑙 ∈ 𝑇, 𝑆 𝑗, 𝑘 ⇒ 𝑔 𝑗, 𝑙 > 𝑈 ⋀ 𝑔 𝑘, 𝑙 > 𝑈 𝑆 𝑞 1 , 𝑞 3 p 3 𝑔 𝑞 2 , 𝑞 3 p 1 p 2 𝑆 𝑞 1 , 𝑞 2
Protein similarity 22 James Larus, EPFL § Similarity function f • Smith Waterman alignment >T (threshold parameter) Protein Clustering: Parallelizing an Expensive, Irregular Computation Seq A (rep) § Not transitive f > T Seq B f > T Seq C § Comparison order matters A B C B A C
Protein transitivity 23 James Larus, EPFL X S-W score Y Protein Clustering: Parallelizing an Expensive, Irregular Computation uX uY Uncovered Subsequence R(X, Y) score > minT, uY < maxU R(Y, X) score > minT, uX < maxU
Incremental greedy 24 James Larus, EPFL precise clustering § Construct clusters one element at a time § First element becomes cluster representative Protein Clustering: Parallelizing an Expensive, Irregular Computation p 1
Incremental greedy 25 James Larus, EPFL precise clustering § Compare subsequent elements against cluster representative Protein Clustering: Parallelizing an Expensive, Irregular Computation R? f? p 2 p 1
Incremental greedy 26 James Larus, EPFL precise clustering § If transitively similar , add to cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation R p 2 ➔ p 1 p 1 p 2
Incremental greedy 27 James Larus, EPFL precise clustering § If only similar , add to cluster and create a new cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation f p 2 ➔ p 1 p 1 p 2 p 2
Protein Clustering: Parallelizing an Expensive, Irregular Computation § Continue until all elements clustered precise clustering Incremental greedy … James Larus, EPFL 28
Parallelism 29 James Larus, EPFL § Unlike original Wittwer algorithm, order does not matter for precise clustering Protein Clustering: Parallelizing an Expensive, Irregular Computation § Clusters can be constructed independently and merged R() ?
Protein Clustering: Parallelizing an Expensive, Irregular Computation Merging clusters R() ? James Larus, EPFL 30
Protein Clustering: Parallelizing an Expensive, Irregular Computation Merging clusters R() ? f() ? f() James Larus, EPFL 31
Protein Clustering: Parallelizing an Expensive, Irregular Computation Set 2 Set 1 Merging sets of clusters 1 2 3 4 1 4 2 & 3 James Larus, EPFL 32
Protein Clustering: Parallelizing an Expensive, Irregular Computation Cluster merge James Larus, EPFL 33
Protein Clustering: Parallelizing an Expensive, Irregular Computation Parallelization 1 James Larus, EPFL 34
Parallelization 2 35 James Larus, EPFL § Parallelize merge of two large sets § Each computation is a partial merge Protein Clustering: Parallelizing an Expensive, Irregular Computation
Recommend
More recommend