Accelerating Sequencing with GPU Computing and Deep Learning Johnny Israeli
COMPUTE TRENDS GPU-Computing perf 10 1.5X per year APPLICATIONS 7 10 6 ALGORITHMS 1.1X per 10 year 5 10 SYSTEMS 4 10 CUDA 1.5X per 3 10 year 2 Single-threaded perf ARCHITECTURE 2
COMPUTE TRENDS Publications 3
Sequencing Trends 4
SEQUENCING TRENDS Sequencing Data Growing in Volume and Complexity Rise of Single Cell Decreasing Cost Increasing Read Length Data 5
SEQUENCING TRENDS 6
Worldwide Annual Sequencing Capacity 10 21 10 18 SEQUENCING TRENDS 10 15 10 12 2000 2005 2010 2015 2020 2025 7
Sequencing Data Types 8
SEQUENCING TRENDS: Genomics *ENA Database 9
SEQUENCING TRENDS: Transcriptomics *ENA Database 10
SEQUENCING TRENDS: Epigenomics *ENA Database 11
SEQUENCING TRENDS: Nanopore Long Read Sequencing *ENA Database 12
Variant Calling 13
Variant Calling Reference TGGATTTGAAAAC G GAGCAAATGACTG TGGATTTGAAAAC G GAGCAAATGACTG Illumina TGGATTTGAAAAC G GAGCAAATGACTG Reads TGGATTTGAAAAC A GAGCAAATGACTG TGGATTTGAAAAC A GAGCAAATGACTG Map to Sequence TGGATTTGAAAAC A GAGCAAATGACTG Reference DNA TGGATTTGAAAAC G GAGCAAATGACTG ● Identify sites with potential mismatch Likely heterozygous variant ● True variants or instrument errors? ● SNPs or insertions or deletions? ● Heterozygous or homozygous variants?
Example Pileup Input Data Read Index Heterozygous SNP Position
GATK Variant Calling Pipeline Variant Calling Pipeline Sort Align to Mark Duplicates Call Variants Joint Call Filter Variants Reference Calibrate 16
Accelerated GATK Variant Calling Pipeline Variant Calling Pipeline Sort Align to Mark Duplicates Call Variants Joint Call Filter Variants Reference Calibrate Parabricks Variant Alignment Preprocessing Variant Calling Joint Genotyping Processing 17
Accelerated Variant Calling Pipelines Whole Genome Processing in Minutes Alignment + Haplotype Mutec2 GenotypeGVCF DeepVariant Preprocessing Caller Parabricks Germline Copy Number Somatic Alignment Preprocessing Variant Calling Joint Genotyping Variant Processing 18
Deep Averaging Network (DAN)
DAN Development ● PyTorch-based 1D model ● Learned embeddings of bases ● Encoding variant proposals ● Downsample easy variant candidates during training
Variant Calling Errors
Variant Calling Error Breakdown
Atac Sequencing 23
DNA: Open And Closed Closed DNA inactive Open DNA active Open DNA changes affect development & disease 24
Atac Sequencing Mapping Open DNA Sites Sequence Map & Count Open DNA Reads Open DNA site Open DNA site 25
Atac-seq Limits Atac-seq signal degrades in due to: Less sequencing • Low quality sample preparation • Small cell populations • 26
AtacWorks SDK AI-Denoised ATAC-seq Data Processing High Quality Sequencing Low Quality Sequencing Sequence Map, Align, Low Quality Open DNA Count Sequencing Denoised with AtacWorks AI 27
AtacWorks Model Denoising + Open Chromatin Identification Input (Noisy ATAC-Seq data) Predicted Coverage Predicted open Resblock 1 Resblock 2 Resblock 3 Resblock 4 Resblock 5 Resblock 6 Resblock 7 chromatin Evaluation: Evaluation: MSE AUPRC Pearson correlation ⊕ ReLU Conv ReLU Conv ReLU Conv 28
Denoising Low Sequencing Data AtacWorks identifies open chromatin from low-coverage data 50 Million Reads 1 Million Reads 1 Million Reads + AtacWorks 29
Genome-wide Sequencing Reduction AtacWorks Reduces Sequencing Requirements 3x 1M Reads 1M Reads + AtacWorks 30
Denoising Low Quality Sample AtacWorks improves signal-to-noise ratio in low quality samples Distance from transcription start site 31
Denoising Single Cell Atac-seq Data AtacWorks Improves Open DNA Detection From Few Cells Open DNA Detection auPRC 90 Cells 90 Cells With AtacWorks 32
AtacWorks SDK SDK on Clara Genomics: https://github.com/clara-genomics/AtacWorks AtacWorks Preprint: https://www.biorxiv.org/content/10.1101/829481v1 90 Cells + AtacWorks 1M Reads + AtacWorks 1M Reads 90 Cells Reduce Sequencing Cost Improve Sample Quality Increase Single Cell Resolution 33
Genome Assembly 34
Long Read De Novo Assembly Step 2: Overlap graph Step 3: Error correction to Step 1: Mapping to detect traversal to generate polish genomes overlaps between reads draft genomes Draft genome Original reads ACTCGGTCATTCGTGCTTTATC GCGTTATCGTCTACTTCGT 35
Genome Assembly Workflow Genome Assembly Pipeline Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka 36
Accelerated Genome Assembly Workflow Before ClaraGenomicsAnalysis Genome Assembly Pipeline Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN 37
Accelerated Genome Assembly Workflow ClaraGenomicsAnalysis 0.1 Genome Assembly Pipeline Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cudaPOA cuDNN 38
Accelerated Genome Assembly Workflow ClaraGenomicsAnalysis 0.2 Genome Assembly Pipeline Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cudaAligner cudaPOA cuDNN 39
Accelerated Genome Assembly Workflow ClaraGenomicsAnalysis 0.3 Genome Assembly Pipeline Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cudaAligner cudaPOA cuDNN cudaMapper 40
ClaraGenomicsAnalysis SDK Enabling Accelerated Genome Assembly Bacteria Genome Assembly Acceleration Assembly Pipeline Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cudaAligner cudaPOA cuDNN cudaMapper Azure v32 CPU V100 GPU 41
CLARA GENOMICS SW Open Source CUDA-Accelerated Sequencing Analysis Tools APPLICATIONS Reference Applications BASECALLING GENOME ASSEMBLY AI-DENOISED ATAC-SEQ Integration with 3rd Party ClaraGenomicsAnalysis SDK AtacWorks SDK Applications and Workflows Transfer Optimized C++ API Python API Learning Inference C++ and Python APIs cudaAligne Genomics Reference cudaMapper cudaPOA r I/O Models CUDA Accelerated HPC and CUDA Deep Learning Modules 42
Useful Links Parabricks: https://www.parabricks.com • ClaraGenomicsAnalysis • SDK on GitHub: https://github.com/clara-genomics/ClaraGenomicsAnalysis • C++ API Examples: cudapoa, cudaaligner • Python API Examples: cudapoa, cudaaligner • AtacWorks • SDK on GitHub: https://github.com/clara-genomics/AtacWorks • AtacWorks Preprint: https://www.biorxiv.org/content/10.1101/829481v1 • 3rd party integrations: • Racon: https://github.com/lbcb-sci/racon • Raven: https://github.com/lbcb-sci/raven • Bonito: https://github.com/nanoporetech/bonito • Additional GPU Accelerated Genomics Applications: • Kipoi Model Zoo: https://ngc.nvidia.com/catalog/containers/hpc:kipoi • SigProfiler: https://github.com/AlexandrovLab/SigProfilerExtractor • 43
Accelerating Sequencing with GPU Computing and Deep Learning Johnny Israeli
Recommend
More recommend