UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association - PowerPoint PPT Presentation

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Jan C. Kässens*, Jorge González-Domínguez** , Lars Wienbrandt*,Bertil Schmidt** *Department of Computer Science, Christian-Albrechts-University of Kiel, Germany {jka,lwi}@informatik.uni-kiel.de **Parallel and Distributed Architectures Group, Johannes Gutenberg University of Mainz, Germany {j.gonzalez,bertil.schmidt}@uni-mainz.de IEEE International Conference on Cluster Computing Cluster 2014

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction 1 Methodology 2 3 UPC++ Implementation Experimental Evaluation 4 Conclusion 5

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Introduction 1 Methodology 2 UPC++ Implementation 3 Experimental Evaluation 4 5 Conclusion

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (I) Analyses of genetic influence on diseases

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals K cases

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals K cases C controls

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (I) Analyses of genetic influence on diseases M individuals K cases C controls N genetic markers, Single Nucleotide Polymorphisms (SNPs). 3 genotypes: Homozygous Wild (w, AA, 0) Heterozygous (h, Aa, 1) Homozygous Variant (v, aa, 2)

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (II) Cases Controls SNP 1 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 1 SNP 2 0 1 1 0 2 0 0 0 1 2 2 1 0 1 1 2 SNP 3 0 0 0 0 0 0 0 0 1 2 1 1 1 2 1 1 SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 5 0 2 2 2 0 1 1 1 1 0 0 1 1 0 2 2 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction Genome-Wide Association Studies (and III) Definition Two SNPs present epistasis or interaction if: Their joint genotype frequencies show a statistically significant difference between cases and controls which potentially explains the effect of the genetic variation leading to disease. The difference between cases and controls shown by the joint values is significantly higher than using only the individual SNP values.

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction BOOST BOolean Operation-based Screening and Testing Binary traits Exhaustive search Statistical regression Good accuracy (used by biologists) Returns a list of SNP pairs with high interaction probability Fastest available tool. Intel Core i7 3.20GHz: 40,000 SNPs and 3,200 individuals About 800 million pairs 51 minutes 500,000 SNPs and 5,000 individuals About 125 billion pairs (moderated size) Estimated 7 days

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction GBOOST CUDA version for GPUs Same accuracy as BOOST 40,000 SNPs and 6,400 individuals About 800 million pairs 28 seconds on a GTX Titan 500,000 SNPs and 5,000 individuals About 125 billion pairs (moderated size) 1 hour on a GTX Titan

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction GBOOST CUDA version for GPUs Same accuracy as BOOST 40,000 SNPs and 6,400 individuals About 800 million pairs 28 seconds on a GTX Titan 500,000 SNPs and 5,000 individuals About 125 billion pairs (moderated size) 1 hour on a GTX Titan High-throughput genotyping technologies collect few million SNPs of an individual within a few minutes → Expected datasets with 5M SNPs and 10,000 individuals

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction UPC++ (I) Unified Parallel C++ Novel extension of ANSI C++ Y Zheng, A Kamil, M Driscoll, H Shan, and K Yelick. a PGAS Extension for C++ . In Proc. 28th UPC++: IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS’14) , Phoenix, AR, USA, 2014. Follows the Partitioned Global Address Space (PGAS) programming model Single Program Multiple Data (SPMD) execution model Works on shared and distributed memory systems

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Introduction UPC++ (and II) Global memory logically partitioned among processes Processes can directly access (read/write) any part of the global memory Memory with affinity usually mapped in the same node (faster accesses)

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Introduction 1 Methodology 2 UPC++ Implementation 3 Experimental Evaluation 4 5 Conclusion

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Creation of Contingency Tables (I) For each SNP-pair → Number of occurrences of each combination of genotypes Cases SNP2=0 SNP2=1 SNP2=2 SNP1=0 n 000 n 010 n 020 SNP1=1 n 100 n 110 n 120 SNP1=2 n 200 n 210 n 220 Controls SNP2=0 SNP2=1 SNP2=2 SNP1=0 n 001 n 011 n 021 SNP1=1 n 101 n 111 n 121 SNP1=2 n 201 n 211 n 221

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Creation of Contingency Tables (and II) SNP 4 0 1 0 1 0 1 0 1 2 2 2 2 1 1 1 1 SNP 6 1 0 1 0 1 0 1 0 1 2 1 2 1 2 2 1 Cases SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 4 0 SNP4=1 4 0 0 SNP4=2 0 0 0 Controls SNP6=0 SNP6=1 SNP6=2 SNP4=0 0 0 0 SNP4=1 0 2 2 SNP4=2 0 1 2

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Filtering Stage (I) Measuring interaction via log-linear models

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Filtering Stage (I) Measuring interaction via log-linear models Log-Linear Measure (I) � ˆ � �� π ijk L S − ˆ ˆ � L H = N π ijk log ˆ ˆ p ijk ijk ˆ L S log-likelihood of the saturated regression model ˆ L H log-likelihood of the homogeneous association model ˆ π ijk joint distribution obtained under the saturated model ˆ p ijk distribution obtained under the homogeneous association model

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Filtering Stage (II) Measuring interaction via log-linear models Log-Linear Measure (II) � ˆ � �� π ijk L S − ˆ ˆ � L H = N π ijk log ˆ ˆ p ijk ijk T the threshold for epistasis If ˆ L S − ˆ L H > T ⇒ Epistasis

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Filtering Stage (II) Measuring interaction via log-linear models Log-Linear Measure (II) � ˆ � �� π ijk L S − ˆ ˆ � L H = N π ijk log ˆ ˆ p ijk ijk T the threshold for epistasis If ˆ L S − ˆ L H > T ⇒ Epistasis Computationally expensive ˆ p ijk computed through iterative methods

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Filtering Stage (III) Kirkwood Superposition Approximation (KSA) � � �� ˆ ˆ L S − ˆ π ijk L KSA = N � π ijk log ˆ ijk p k ˆ ijk p k ijk = 1 π ij . π i . k π . jk ˆ π i .. π. j .π.. k η π ij . π i . k π . jk η = � ijk π i .. π . j . π .. k

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Methodology Filtering Stage (III) Kirkwood Superposition Approximation (KSA) � � �� ˆ ˆ L S − ˆ π ijk L KSA = N � π ijk log ˆ ijk p k ˆ ijk p k ijk = 1 π ij . π i . k π . jk ˆ π i .. π. j .π.. k η π ij . π i . k π . jk η = � ijk π i .. π . j . π .. k Upper bound: ˆ L S − ˆ L H ≤ ˆ L S − ˆ L KSA

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association - PowerPoint PPT Presentation

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Jan C. Kssens*, Jorge Gonzlez-Domnguez** , Lars Wienbrandt*,Bertil Schmidt**

CoMo-UPC TMA evaluation service @ UPC Pere Barlet-Ros Josep Sanjus-Cuxart Advanced Broadband

KnowledgeWeb UPC Introduction Semantic Web Education Activities and Potential Contributions

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Genome Wide Haplotype analyses Genome Wide Haplotype analyses of human complex diseases with the

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

EGNOS TUTORIAL Research g roup of A stronomy and GE omatics (gAGE/UPC) Universitat Politcnica

4. Multiagent Systems Design Part 6: Coordination (I). Explicit Coordination ems (SMA-UPC)

3. Reasoning in Agents Part 2: BDI Agents ems (SMA-UPC) Javier Vzquez-Salceda q Multiagent

1. Introduction ( (to Agents and Multiagent g g Systems) ems (SMA-UPC) Javier

RFID UPC Wallace Flint first suggested an automated checkout in 1932 UPC bar code formats

4. Multiagent Systems Design Part 4: Coordination models (I): ( ) Social Models ems (SMA-UPC)

Graphical Modelling in Genetics and Systems Biology Marco Scutari m.scutari@ucl.ac.uk Genetics

1 SETTING THE SCENE Main references: Ziegler A and Knig I. A Statistical approach to genetic

Question 1-no right answer Assuming you have living relatives, if you were diagnosed Genetics of

Adaptation in polygenic traits Criteria for sweeps and shifts Joachim Hermisson Mathematics

Tracking the spread of Tracking the spread of insecticide resistance in insecticide resistance

Selection and haplotypes EHH statistics Anders Albrechtsen Haplotypes Signature of selection

Molecular classification of colorectal cancer Fred T Bosman University Institute of Pathology

BIAS: Bluetooth Impersonation AttackS Daniele Antonioli (EPFL), Nils Tippenhauer (CISPA), Kasper

Sambuz

Useful Links

Newsletter

Mail Us

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association - PowerPoint PPT Presentation

UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies UPC++ for Bioinformatics: A Case Study Using Genome-Wide Association Studies Jan C. Kssens*, Jorge Gonzlez-Domnguez** , Lars Wienbrandt*,Bertil Schmidt**

CoMo-UPC TMA evaluation service @ UPC Pere Barlet-Ros Josep Sanjus-Cuxart Advanced Broadband

KnowledgeWeb UPC Introduction Semantic Web Education Activities and Potential Contributions

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

Genome Wide Haplotype analyses Genome Wide Haplotype analyses of human complex diseases with the

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

EGNOS TUTORIAL Research g roup of A stronomy and GE omatics (gAGE/UPC) Universitat Politcnica

4. Multiagent Systems Design Part 6: Coordination (I). Explicit Coordination ems (SMA-UPC)

3. Reasoning in Agents Part 2: BDI Agents ems (SMA-UPC) Javier Vzquez-Salceda q Multiagent

1. Introduction ( (to Agents and Multiagent g g Systems) ems (SMA-UPC) Javier

RFID UPC Wallace Flint first suggested an automated checkout in 1932 UPC bar code formats

4. Multiagent Systems Design Part 4: Coordination models (I): ( ) Social Models ems (SMA-UPC)

Graphical Modelling in Genetics and Systems Biology Marco Scutari m.scutari@ucl.ac.uk Genetics

1 SETTING THE SCENE Main references: Ziegler A and Knig I. A Statistical approach to genetic

Question 1-no right answer Assuming you have living relatives, if you were diagnosed Genetics of

Adaptation in polygenic traits Criteria for sweeps and shifts Joachim Hermisson Mathematics

Tracking the spread of Tracking the spread of insecticide resistance in insecticide resistance

Selection and haplotypes EHH statistics Anders Albrechtsen Haplotypes Signature of selection

Molecular classification of colorectal cancer Fred T Bosman University Institute of Pathology

BIAS: Bluetooth Impersonation AttackS Daniele Antonioli (EPFL), Nils Tippenhauer (CISPA), Kasper

Sambuz

Useful Links

Newsletter

Mail Us

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly