15th Symposium on Computer Architecture and High Performance - PowerPoint PPT Presentation

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 - S˜ ao Paulo, SP Comparison of Genomes using High-Performance Parallel Computing N. F. Almeida Jr � Universidade Federal de Mato Grosso do Sul C. E. R. Alves � Univsidade S˜ ao Judas Tadeu � E. N. C´ aceres � Universidade Federal de Mato Grosso do Sul S. W. Song � Universidade de S˜ ao Paulo � �

Comparison of Entire Genomes 2/12 • Comparison of genomes is useful to investigate common functionalities of the corresp. organisms • Our purpose is twofold – Use parallel computing so that more expensive alignment methods (dynamice programming) can be used. – Locate and compare not only homologous genes, but also compare the regions between corresponding homologous genes. • As example, we compare � � – Xanthomonas axonopodis pv. citri with 5,175,554 base � pairs and 4,313 protein-coding genes � – Xanthomonas campestris pv. campestris with 5,076,187 � base pairs and 4,182 protein-coding genes. � �

Motivations and Previous Works 3/12 • Homology: two genes share a common evolutionary past. • Often similarity between two DNA or amino-acid sequences may infer homology. • Homology in turn may determine function. Thus: Similarity → homology → function Rasera, Setubal, Almeida et al. [2002] compare the whole genomes of Xanthomonas axonopodis pv. citri and Xanthomonas � campestris pv. campestris and conclude both share more � than 80% of the genes. � � � � �

Comparison Strategy - Main Ideas 4/12 Given two genomes G and H and their gene locations: 1. Find and label pairs of homologous genes. 1 2 3 4 5 6 q ❇ ❇ ❇ ❇ q q ✡ ✡ q ✂ ✂ q q g ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ ❇ ❇ ✡ ✂ h q q q q q q 1 2 4 3 5 6 2. Find the non-crossing pairs of homologous genes. � � 1 2 3 5 6 � q q q q q ❇ ❇ ❇ ❇ ✂ ✂ g g ′ ❇ ❇ ✂ � ❇ ❇ ✂ ❇ ❇ ✂ ❇ ❇ ✂ � ❇ h ′ ❇ ✂ h q q q q q 1 2 3 5 6 � �

Comparison Strategy (continued) 5/12 1 2 3 5 6 q q q q q ❇ ❇ ❇ ❇ ✂ ✂ g g ′ ❇ ❇ ✂ ❇ ❇ ✂ ❇ ❇ ✂ ❇ ❇ ✂ ❇ h ′ ❇ ✂ h q q q q q 1 2 3 5 6 3. Align each pair of homologous genes. 4. Align each pair of intergenic regions (e.g. [ g, g ′ ] and [ h, h ′ ] ). 5. Join all alignments. � � � � � � �

Comparison Strategy - Details 6/12 1: Find pairs of the homologous genes: For all g of G , obtain h of H such that DP-score ( g, h ) = max { DP-score ( g, w ) for all w of H } 2: Label the homologous genes of G : Label the homologous genes of G as 1 , 2 , . . . , m in the same order as their positions in the genome G . Let LabelG denote the sequence of labels obtained in this step. 3: Label the corresponding homologous genes of H : � For all pairs of homologous genes ( g, h ) , g of G , h of � H , label gene h with the same label of g . � Let LabelH denote the sequence of labels obtained in � this step. � � �

7/12 4: Find the non-crossing pairs of homologous genes : Obtain the LCS( LabelG , LabelH ). the LCS obtained con- tains only the non-crossing pairs 5: Align each pair of homologous genes : For each non crossing homologous pair ( g, h ) do DP- align ( g, h ) . 6: Align each pair of intergenic regions : For each intergenic region [ g, g ′ ] , where [ g, g ′ ] of G are two consecutive genes of the LCS, obtain the corresponding intergenic region [ h, h ′ ] in H and do � DP-align ([ g, g ′ ] , [ h, h ′ ]) . � 7: Join all the alignments : � Concatenate the alignments of the homologous genes � and the intergenic regions, in the same order they � appear in the genomes. � �

Computing Similarity of Two Strings 8/12 A simple example of string alignment: A a c t t c a – t a t t c – a c g C Score 1 0 1 0 0 1 0 0 3 a c t t c a – t A C a – t t c a c g Score 1 0 1 1 1 1 0 0 5 Using dynamic programming (gives better quality alignments): b a a b c a b c a b (0 , 0) � b a � ( i − 1 , j − 1) ( i − 1 , j ) a � b c ( i, j − 1) ( i, j ) b � c a � (8 , 10) � �

The Parallel Solution 9/12 • Finding homologus pairs (the most time consuming phase): compare all the genes of one genome with all the genes of another: more than 18 million alignments by dynamic programming. Two types of parallelisms are used: – Master distributes the alignment tasks to slave processors. – When the lengths of the sequences to be aligned exceed 5,000 base pairs, parallel dynamic program- � ming is used. � • Finding the non-crossing homologous gene pairs: We � used a parallel LCS (longest common subsequence) al- � gorithm. (Could have used LIS - longest increasing � subsequence algorithm.) � �

The Parallel Platform Used 10/12 • 64-node Beowulf cluster - low cost microcom- puters with 256MB RAM, 256MB swap mem- ory, CPU Intel Pentium III 448.956 MHz, 512KB cache. • 100 Mb fast-Ethernet switch. • Code in standard ANSI C and LAM-MPI Ver- sion 6.5.6. � � � � � � �

Preliminary Implementation Results 11/12 • Finding homologus pairs (most time consuming): Sequential solution using Blast and EGG: 3 hours. Parallel solution using dynamic programming: 1 hour 15 minutes. • Finding non-crossing pairs (surely not the dominant step): Sequential solution using Blast and EGG: not avail- able. � Parallel solution using dynamic programming: 20 sec- � onds. � � � � �

Conclusion 12/12 We compared the whole genomes of two organisms: • Exploited parallelism in two ways: Standard master-slave approach to distribute comparison tasks (sequential dynamice programming) to slave processors. To compute the similarity between two sequences, when- ever the sequences are longer than 5,000 base pairs, we used parallel dynamic programming. • The gain does not seem to be so significant, however � we used a dynamic programming approach that gives better quality results. � � • Our comparison strategy also compares the intergenic � regions between two consecutive homologous genes in � each genome. The relevance of this in a biological � viewpoint is yet to be investigated. �

15th Symposium on Computer Architecture and High Performance - PowerPoint PPT Presentation

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 - S ao Paulo, SP Comparison of Genomes using High-Performance Parallel Computing N. F. Almeida Jr Universidade Federal de Mato Grosso do

and... Cluster Ions Cluster Ions Yaln KALKAN Uluda University 15th RD51 Collaboration

DRAFT | AS OF MAY 15th, 2020 DRAFT BUDGET PRESENTATION April 2, 2019 SUPERINTENDENTS PROPOSED

15th Sha'ban - Birth of Imam Al Mahdi (as) Aamal for Night of 15th Shaban It is advisable to stay

Los Alamos Computer Science Symposium Los Alamos Computer Science Symposium (LACSS) (LACSS)

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

CSE 675.02: three aspects of computer design: instruction set architecture, Introduction to

HEPMAD 07 3rd Conference in High-Energy Physics 10-15th September Antananarivo (Madagascar)

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

Model-based engineering of high-performance embedded applications on heterogeneous hardware with

Cuckoo Filter: Simplification and Analysis David Eppstein 15th Scandinavian Symposium and

Agile Formal Methods Reiner H ahnle 6th International KeY Symposium Nomborn 15th June 2007

Towards Knowledge-guided Genetic Improvement [1] GI@ICSE 3. July 2020 Abstract -- Grammar-guided

The I ncompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University

Stability in the Homology of Torelli Groups Jenny Wilson (Michigan) joint with Jeremy Miller

12-11-06 Phylogenetics 2: Phylogenetic and genealogical homology Phylogenies distinguish

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

The Least Spanning Area of a Knot and the Optimal Bounding Chain Problem Nathan M. Dunfield

Map the following onto this image. These are kind of imprecise arrows but I thought thinking

Multiple Alignments and Phylogenies Mark Voorhies 3/31/2011 Mark Voorhies Multiple Alignments

15th Symposium on Computer Architecture and High Performance - PowerPoint PPT Presentation

15th Symposium on Computer Architecture and High Performance Computing 1/12 November 10 to 12 - S ao Paulo, SP Comparison of Genomes using High-Performance Parallel Computing N. F. Almeida Jr Universidade Federal de Mato Grosso do

and... Cluster Ions Cluster Ions Yaln KALKAN Uluda University 15th RD51 Collaboration

DRAFT | AS OF MAY 15th, 2020 DRAFT BUDGET PRESENTATION April 2, 2019 SUPERINTENDENTS PROPOSED

15th Sha'ban - Birth of Imam Al Mahdi (as) Aamal for Night of 15th Shaban It is advisable to stay

Los Alamos Computer Science Symposium Los Alamos Computer Science Symposium (LACSS) (LACSS)

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

CSE 675.02: three aspects of computer design: instruction set architecture, Introduction to

HEPMAD 07 3rd Conference in High-Energy Physics 10-15th September Antananarivo (Madagascar)

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture &amp; Computer Architecture &amp;

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

Model-based engineering of high-performance embedded applications on heterogeneous hardware with

Cuckoo Filter: Simplification and Analysis David Eppstein 15th Scandinavian Symposium and

Agile Formal Methods Reiner H ahnle 6th International KeY Symposium Nomborn 15th June 2007

Towards Knowledge-guided Genetic Improvement [1] GI@ICSE 3. July 2020 Abstract -- Grammar-guided

The I ncompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University

Stability in the Homology of Torelli Groups Jenny Wilson (Michigan) joint with Jeremy Miller

12-11-06 Phylogenetics 2: Phylogenetic and genealogical homology Phylogenies distinguish

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

The Least Spanning Area of a Knot and the Optimal Bounding Chain Problem Nathan M. Dunfield

Map the following onto this image. These are kind of imprecise arrows but I thought thinking

Multiple Alignments and Phylogenies Mark Voorhies 3/31/2011 Mark Voorhies Multiple Alignments

ICS 233 ICS 233 ICS 233 ICS 233 Computer Architecture & Computer Architecture &