A Quick Guide to the Analytics Behind Genomic Testing Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories 1
Learn earning ing Objec Objectives ives Catalogue various types of bioinformatics analyses that support clinical genomic testing Enumerate types of variant classes Describe algorithmic methods for variant detection by NGS Compare and contrast germline and somatic clinical bioinformatics pipeline methodologies Discuss the infrastructure complexity required to support analytics for NGS testing at scale in the cloud Explain validation strategies for bringing best-in-class pipelines into clinical production
The Hu The Huma man n Re Reference ference Genom Genome ~ 3B 3B ba base se pa pairs irs structured into 23 chromosome pairs 3,098,825,702 base pairs 20,805 coding genes 14,181 pseudogenes 196,501 gene transcripts
Why Why Genom Genomic ic Tes Testing ing? 1 in 4 cancer deaths KRAS are from lung G12D cancer. ~ 222,500 new cases of lung cancer in the U.S. in 2017.
Genom Genomic ic Tes Testing ing Short-Read Sequencers Long-Read Sequencers Illumina PacBio Ion-Torrent NanoPore 10X Nanostring
Types Types o of f NG NGS S Tes Testing ing — Somat Somatic ic & Germl & Germline ine 7
Types Types o of f NG NGS S Tes Testing ing — cfDNA cfDNA an and d ctDNA ctDNA Non-Invasive Prenatal Testing (NIPT) Liquid Biopsy EGFR Trisomy 21 (Down Syndrome) Non-small cell lung cancer
Types Types o of f NG NGS S Tes Testing ing — Inf Infec ectiou ious Dise Disease ase
Types Types o of f NG NGS S Tes Testing ing — RNA RNA-Seq Seq Alternate transcripts • Novel gene isoforms • Gene fusions •
Role of Clinica Role of Clinical B l Bioi ioinf nform ormat atics ics Other computationally Build pipelines Provide supplemental heavy analytics are information for clinical involved in evaluating : interpretation and quality control Design of new panels Identification of genetic patterns in patient cohorts Discovery of gene pathways
Understanding bioinformatics requires understanding the laboratory process.
Varian Vari ant Call Calling ing Pi Pipeline peline Steps in a bioinformatics pipeline: 1. Sample demultiplexing 2. Read alignment 3. BAM polishing steps 4. Variant calling 5. Variant annotations 6. QC calculations
Step 1 Step 1: : Sa Sampl mple e Demult Demultipl iplexing exing
Step 2 Step 2: : Read Read Alig Alignment nment Read Alignment SAM Format
Step 3 Step 3: : BAM Po BAM Polishin lishing g Steps Steps PCR Duplicate Removal Base Quality Score Recalibration Q30 Phred base quality score → 99.9% → 1/1000 homopolymer +1% reference A T C C C T G C A T C C T G A T C C T G C read A T C C T G C A T C C T G C A T C C T G C
Step 4 Step 4: : Va Varia riant nt Call Calling ing by C by Cla lass
Exampl Exa mple e Vari Varian ant Call Calling ing Algor Algorit ithms hms SNV/Insertion/Deletion Duplications/Structural variants Position based callers Pattern growth approach (Pindel) • • (GATK Unified Genotyper, LoFreq) Split reads, discordant paired-end reads • Local de-novo assembly of haplotypes (Manta, DELLY, CREST) • (GATK Haplotype Caller) kmer + de-novo assembly (BreaKmer) • Graph based variant callers • Unmapped or partially mapped reads • (Graph Genome) (ITD Assembler) Neural networks (Deep Variant) • Depth of coverage + background error • correction + principal component analysis (XHMM) Tumor/normal • B allele frequency •
Example KRAS G12D Variant Cell
Step 5 Step 5: : Va Varia riant nt Annot Annotat ation ions VCF variant Annotated variant The VCF variant includes: The annotated variant includes: chromosome • Gene • position • Gene Transcript • ID • Nucleotide change (cdot) • reference base • Protein change (pdot) • alternate base • Variant Type • variant quality • Polymorphism – meta-information • Synonymous information and individual format – – fields Non-synonymous – filter flags – Nonsense • Missense • Frame shift –
Step 6: Step 6 : QC QC Calcula Calculation ions Sample Sequencing QC metrics Report Ru Run Lev Level el Sam ample e Lev Level el Var Varian ant t Lev Level el Cluster density Depth coverage Novel variants Base call quality score Uniformity Known variants Fragment size Mapping quality Transition-to-transversion ratio Duplication rate
Sampl Sample-Leve evel l QC QC Metr Metrics ics fo for Target r Targeted ed Capt Capture ure Uniformity Minimum Depth of Coverage On-Target Mapping Quality Off-Target Read Depth Duplication Rate Intronic regions Exon Gene
Compute Comput e Inf Infra rastruct ructur ure e fo for Dat r Data a Pr Proce ocessing ing Job 3 Job 4 How does a Job 5 bioinformatics job get executed in Job 1 Job 2 clinical production?
Dat Data Stora a Storage ge Inf Infra rastruct ructur ure Object Storage Archive Database “hot” “cold” 99.9% availability 99.999999999% durability Cloud based BCL FASTQs FASTQs, (500 – 550 GB) (12 – 15 GB) BAMs, VCFs HiSeq 4000 Raw output Exome Bioinformatics for a single run ~150x In-house
Bioin Bioinfo forma rmatics ics Pi Pipeline V peline Val alida idation ion Recommendations from CAP/AMP • – 17 recommendation statements – 59 variants tested in each variant class Example Statistics • – Positive percentage agreement (PPA) – Positive predictive value (PPV) – Reproducibility – Allelic fraction lower limit of detection Validation required prior to use in clinical production •
Summa Summary ry Catalogue various types of bioinformatics analyses that support clinical genomic testing Enumerate types of variant classes Describe algorithmic methods for variant detection by NGS Compare and contrast germline and somatic clinical bioinformatics pipeline methodologies Discuss the infrastructure complexity required to support analytics for NGS testing at scale in the cloud Explain validation strategies for bringing best-in-class pipelines into clinical production
Quest Questions? ions? Elaine Gee, PhD Director of Bioinformatics ARUP Laboratories elaine.gee@aruplab.com
Recommend
More recommend