Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs - PowerPoint PPT Presentation

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs Bertil Schmidt Christian Hundt

Contents • Gene Set Enrichment Analysis (GSEA) – Background – Algorithmic details • cudaGSEA • Performance evaluation

GSEA and Bioinformatics • High throughput technologies generate large-scale gene expression data sets – RNA-Seq – Microarrays • GSEA uses annotated gene sets to mine a given gene expression matrix – MSigDB contains over 10K signatures each containing around 100 gene identifiers on average • Typical GSEA study: – identify metabolic pathways that are differentially changed in human type-2 diabetes

Gene Set Enrichment Analysis • Reveals correlation between gene sets and diseases using gene expression data • State-of-the-art tool with over 10,000 citations • Written in (multi-threaded) Java • Highly time consuming – analyzing 20,639 genes measured in 200 patients with 4,725 pathways and 1M permutations takes around 1 week with GSEA 2.2.2 software on a CPU • We present – GSEA parallelization on a GPU using CUDA (cudaGSEA) – cudaGSEA around two orders-of- magnitude faster than BroadGSEA

GSEA Algorithm – Gene Ranking • Gene expression matrix D obtained from RNA-Seq or Microarray experiments • For each gene i and patient j with associated (binary) phenotype C expression value D [ i , j ] is stored Diseases driven by complex gene interactions  simply reporting top-ranked genes • produce many false positives • Domain experts provides set of genes that might possibly explain observed phenotypes

GSEA Algorithm – Enrichment score • Enrichment score (ES) measure correlation between given gene set S and calculated gene ranking g  ( i ) – Report maximum deviation of a running sum  ( k ) – Sum increases if we hit a member of S and decreases otherwise How significant is ES = 0.857?  p -value calculation using permutation testing •

GSEA Algorithm – Permuation testing

GSEA Algorithm |ES| -|ES| • Histogram of 1,000,000 enrichment scores gained by permuting patient phenotypes • Estimate p -value by counting events in both tails • Why so many permutations? – When testing 1,000 gene sets at significance level p <0.001 we need more than 1,000,000 samples to reject null hypothesis at 1,000  p < 0.001 (Bonferroni correction)

Transpose D to CUDA Parallelization ensure coalesced memory accesses

CUDA Parallelization

CUDA Implementation Details • Support for single-precision and double-precision • Resulting matrix of enrichment scores (#gene sets x #permutations) can be large – e.g. 5K x 1M x 8B = 40GB • p -value estimation, Family-wise error rate (FWER), normalized enrichment score (NES) computation can be accomplished on the GPU with ( sum / max ) reduction kernels without the need for storing this matrix • False discovery rate (FDR) computation this matrix is transferred to the CPU for post-processing

cudaGSEA Features • Reading data sets directly in Broad Institute-compatible file formats • Supporting several local deviation measures – Mean-based measures (difference/quotient/log-quotient of means) – Mean and standard deviation-based measures (signal to noise- ratio, t -tests, one/two-pass estimation) – Numerically stable summation schemes for local measures and ES (Kahan etc.) • Package for the R framework and standalone application • Multi-threaded CPU version in C++ using OpenMP

Performance Evaluation • GSE19429 dataset – collapsed to 20,639 gene symbols; 200 patients (183 cases + 17 controls) • Hallmark: 50 gene sets – MSigDB 5.1 smallest gene set collection • GeForce Titan X (single precison) / Tesla K40c (double precision, ECC off), CUDA 7.5 • 10 core Xeon E5-2660v3@2.60GHz, 20 Threads, Ubuntu 14.04, gcc 4.8.4, 64-bit OpenJDK • BroadGSEA v.2.2.2

Performance Evaluation • GSE19429 dataset – collapsed to 20,639 gene symbols; 200 patients (183 cases + 17 controls) • C2: 4726 gene sets – MSigDB 5.1 largest gene set collection • GeForce Titan X (single precison) / Tesla K40c (double precision, ECC off), CUDA 7.5 • 10 core Xeon E5-2660v3@2.60GHz, 20 Threads, Ubuntu 14.04, gcc 4.8.4, 64-bit OpenJDK • BroadGSEA v.2.2.2

Conclusion • High-throughput technologies establish the need for scalable bioinformatics tools that can process large- scale gene expression data sets • CUDA is a suitable technology to address this need • cudaGSEA on one GPU achieves around two orders-of- magnitude speedup versus BroadGSEA on a CPU – analyzing 20,639 genes measured in 200 patients with 4,726 pathways and 1M permutations takes around 1 week with GSEA 2.2.2 on a Xeon E5-2660v3 CPU while less than 1 hour on a GeForce Titan X • Source code available at: – https://github.com/gravitino/cudaGSEA • Group Website: – https://www.hpc.informatik.uni-mainz.de/

Thank you! Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs Bertil Schmidt, Christian Hundt Institute of Computer Science Johannes Gutenberg University Mainz {bertil.schmidt, hundt}@uni-mainz.de

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs - PowerPoint PPT Presentation

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs Bertil Schmidt Christian Hundt Contents Gene Set Enrichment Analysis (GSEA) Background Algorithmic details cudaGSEA Performance evaluation GSEA and

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Enrichment for Animal and Researcher Well-Being Enrichment Benefits in Research Regulatory

School Day Enrichment Opportunities 3/14/17 14 District Enrichment Teachers (district funded)

Elementary Enrichment Program Update Presentation to the Board of Education Kerin Slattery,

Middle School Enrichment & Acceleration Where will students access enrichment and

Gene Set Enrichment Analysis Robert Gentleman Outline ! Description of the experimental

Gene Set Enrichment Analysis Genome 373 Genomic Informatics Elhanan Borenstein A quick review

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

METRO TUNNEL PROJECT TRAFFIC AND TRANSPORT WORKING GROUP PARKVILLE CRG 9 MARCH 2018 Lachlan

Pakistan Polio Eradication Programme Update Independent Monitoring Board (IMB), Seventeenth

WILDLIFE RANCHING IN SOUTH AFRICA INTRODUCTION History, background and context - Norman

Bioengineered Crops as Tools for Colombian Agricultural Development Opportunities and Strategic

Evaluation of new bactericides for control of fire blight of pears caused by Erwinia amylovora J.E.

Overview General Data Protec4on Regula4on Proposal Medical Devices

TECHNIQUES PRESENTATION: Summary Report 1. Names and contributions of group members:

EU DIRECTIVES AND REFORM OF OSH AND LABOUR RELATIONS LEGISLATION WHITE PAPER OVERVIEW

Sambuz

Useful Links

Newsletter

Mail Us

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs - PowerPoint PPT Presentation

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs Bertil Schmidt Christian Hundt Contents Gene Set Enrichment Analysis (GSEA) Background Algorithmic details cudaGSEA Performance evaluation GSEA and

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

Enrichment for Animal and Researcher Well-Being Enrichment Benefits in Research Regulatory

School Day Enrichment Opportunities 3/14/17 14 District Enrichment Teachers (district funded)

Elementary Enrichment Program Update Presentation to the Board of Education Kerin Slattery,

Middle School Enrichment &amp; Acceleration Where will students access enrichment and

Gene Set Enrichment Analysis Robert Gentleman Outline ! Description of the experimental

Gene Set Enrichment Analysis Genome 373 Genomic Informatics Elhanan Borenstein A quick review

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

METRO TUNNEL PROJECT TRAFFIC AND TRANSPORT WORKING GROUP PARKVILLE CRG 9 MARCH 2018 Lachlan

Pakistan Polio Eradication Programme Update Independent Monitoring Board (IMB), Seventeenth

WILDLIFE RANCHING IN SOUTH AFRICA INTRODUCTION History, background and context - Norman

Bioengineered Crops as Tools for Colombian Agricultural Development Opportunities and Strategic

Evaluation of new bactericides for control of fire blight of pears caused by Erwinia amylovora J.E.

Overview General Data Protec4on Regula4on Proposal Medical Devices

TECHNIQUES PRESENTATION: Summary Report 1. Names and contributions of group members:

EU DIRECTIVES AND REFORM OF OSH AND LABOUR RELATIONS LEGISLATION WHITE PAPER OVERVIEW

Sambuz

Useful Links

Newsletter

Mail Us

Middle School Enrichment & Acceleration Where will students access enrichment and