Center for Biomics Next Generation Sequencing in Molecular Diagnostics Wilfred van IJcken, PhD Erasmus MC Center for Biomics Nov 2 2017 Molecular Diagnostics Course XI
Learning objectives Next generation sequencing (NGS): The basics Illumina sequencing technology Terminology Enrichment technology Clinical applications Targeted gene panels vs exome vs whole genome NIPT Future directions
Next next next generation sequencing… 1st generation sequencing technique: amplified multiple molecule seq Sanger sequencing 2nd generation sequencing techniques: amplified single molecule seq 454 sequencing - Roche SBS sequencing - Illumina Solid sequencing - Applied biosystems/Life technologies Ion Torrent - Life technologies 3rd generation sequencing techniques: Single molecule seq Helicos tSMS PacBio SMRT (real time DNA seq) NanoPore Technologies
NGS systems on the market Desktop High Throughput Special
Sequence technology dynamics Desktop High Throughput Special
What is next generation sequencing? Sequencing technology developed after Sanger Millions of reads in parallel (MPS) Shorter (<400bp) sequencing reads Enables analysis of complex mixtures of DNA or RNA Enables genome wide approach Different vendors with different approaches MPS = massive parallel sequencing
NGS flow Intake Isolate Library Sequence Report yield ID DNA or Select chemistry quality RNA enzymes amount region of sex interest Variation blood detection disease plasma PCR signal Match phenotype? saliva capture FFPE cells
Illumina systems 6 Tb per run Data amount HiSeq X Ten NovaSeq6000 HiSeq 4000 HiSeq 2500 Run costs 8 Gb NextSeq 500 Purchase cost MiSeq MiniSeq
Simplified sample preparation DNA RNA Reverse transcriptase Adaptor 1 Adaptor 2
Bridge amplification lane each DNA molecule hybridizes at different location in flowcell lane
Clustering and Sequencing 3’ 5’ A G T C G A C T T A C C G G A T A A C T C each base has a C G C G different fluorescent A T dye coupled T C G A T Cluster growth 5’ Sequencing 1 2 3 4 5 6 7 8 9 T G C T A C G A T … Base calling Image acquisition
Output file from basecalling Many file types: qseq, fastq, etc… Each system own format. Large file sizes: ~150 million reads per lane Instrument PF (0,1) X-coord Y-coord Index # Read # Run ID Lane Tile Sequence ASCII Character Q-score
Data analysis not trivial due to data volumes and complexity Data Volume Total Final Comment HiSeq 2000 200G run Image Data 32 TB 0 Intensity Data 2 TB 0 Optionally transferred 1 byte/base (raw) assuming Base Call / Quality Score Data 0.25 TB 0.25 TB qseq generation offline Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files GA IIx 50G run Need data storage and compute Image Data 6.9 TB 0 Optionally transferred to handle up to penta bytes of data Intensity Data 0.93 TB 0.93 TB Base Call / Quality Score Data 0.17 TB 0.17 TB Core facilities needed Alignment Output 1.2 TB 1.2 TB
Terminology Next generation sequencing, AKA: - Deep sequencing - MPS = massive parallel sequencing Cluster # of sequencing cycles 1 2 3 4 5 6 7 8 9 = readlength Read T G C T A C G A T …
SingIe-end, paired end, index read Index read Single Read GATCG Paired end read Single read = sequence from one side of the fragment Paired end = sequence from both sides of the fragment
Indexing enables sample multiplexing Index Patient 1 GATCG Patient 2 CGTGA ATCGG Patient 3 TCTCT Patient 4 Index = different nucleic acid code per sample introduced during sampleprep read during index read Enables multiple samples in one flowcell lane
Alignment, Mapping Reference sequence AAAACGCGCTTAGCCTTT T TTCGACTGTCGAGTGGA A CGCCGCTAGCTAGGCGC Heterozygous SNP mismatch Consensus sequence AAAACGCGCTTAGCCTTT T TTCGACTGTCGAGTGGA T CGCCGCTAGCTAGGCGC TAGCCTTT T TTCGACTGTCGAGTGGATCGCCG AGCCTTT T TTCGACTGTCGAGTGGATCGCCGC GCCTTT G TTCGACTGTCGAGTGGATCGCCGCT CCTTT G TTCGACTGTCGAGTGGATCGCCGCTA
Read depth Aka depth of coverage 1 5 7 AAAACGCGCTTAGCCTTT T TTCGACTGTCGAGTGGA T CGCCGCTAGCTAGGCGC TAGCCTTT T TTCGACTGTCGAGTGGATCGCCG AGCCTTT T TTCGACTGTCGAGTGGATCGCCGC GCCTTT G TTCGACTGTCGAGTGGATCGCCGCT CCTTT G TTCGACTGTCGAGTGGATCGCCGCTA GACTGTCGAGTGGATCGCCGCTAGCTAGG CTGTCGAGTGGATCGCCGCTAGCTAGG Average read depth can differ a lot from read depth !
Accuracy, error rate, quality score Single base error rate = Total number of mismatched bases found in mapped sequence reads from a sequencing run, divided by the mappable yield Quality scores (Q scores / phred scores) - derived from an examination of the intensity peaks around each base - range from 0 – 41, higher corresponds to higher quality - Q = -10log 10 p, p is basecall error probability Quality score Probability of Base call incorrect base call accuracy 10 (Q10) 1 in 10 90% 20 (Q20) 1 in 100 99% 30 (Q30) 1 in 1000 99.9%
NGS systems on the market Desktop High Throughput Special Different characteristics Sequencing technology Readlength Speed Output Applications Run cost
NGS Applications whole genome De novo sequencing Epigenetic profiling (DNA methylation) Gene expression analysis Discovery of novel transcripts, splice variants, miRNAs Protein-DNA/RNA interactions (ChIPSeq) genomic DNA interactions (3C, 4C, 5C Seq) Targeted DNA sequencing Exome Sequencing Clinical use Whole genome re-sequencing
Diagnostic applications Targetted sequencing Cardio Myopathies, Ciliopathies, Cancer hotspot panel, Noonan, Neurodegenerative diseases, … Exome sequencing Unknown disease, de novo Whole genome sequencing Unknown disease, non-exonic Non invasive diagnostics prenatal plasma, T21 testing (NIPT) Cancer sequencing germline mutations, therapy HLA typing transplantation
Enrichment technology Exome = all coding regions (~ exons) of genome
Choose your baits Agilent, Nimblegen (Roche), Illumina, IDT, … exome, panel or other targets CRE: boosted coverage for ~5000 clinically relevant genes CRE halo V4 Exome performance Target coverage >20X coverage for 95% of genes Even coverage read depth distribution Specificity of capture gene False pos / neg variants High homology genes
Exome data analysis overview Mapping %, on/off target Mapping % >20x, min, max, bases not sequenced Coverage bases <20x add Sanger amplicons Sanger + low frequency variants + indel Variants + GATK: SNP + indel Filtering Annotation >100 databases, function Copy Exome depth number Dominant, recessive, etc Inheritance
Quality High throughput ISO 15189/17025 accredition needed for clinical use in NL Sample swap is a real possibility Spike-in to uniquely identify each sample after sequencing Spike-in Sequencing Shear Capture A1 QC QC B1 C1
How does targetted sequencing result look?
Zoom in sequence result
Variation is not only SNP Structural variants (SVs), Short InDels SNPs [e.g. kb-Mb-sized deletions, insertions, inversions, fusion genes] GATTTAGATCGCGATAGAG GATT------------GAG GATTTAGATCTCGATAGAG GATTTAGATCTCGATAGAG More difficult to detect than SNPs ~0.1% of the genomes of any presumably >0.1% of the genome two individuals differ due to SNPs
Recent Case report 2005: 5 weeks old girl hospitalized RS virus with artificial respiration 2008: Developmental delay maybe due to braindamage by hypoxia 2011: Re-evaluation clinical geneticist: possibly Sotos syndrome SNParray, Sanger NSD1, PTEN, AOA, fraX, metabolism: Negative 2015: Re-evaluation: speech affected. WES trio filter for ID genes de novo c.1216C>T, p.Gln406* mutation MECP2 -> atypical form of RETT syndrome 2016: RETT specialist: 5 other girls found with atypical RETT syndrome with c-terminal frame shift mutations in MECP2 (unpublished) WES helps to solve previously unsolved cases Evidence increasing to use WES as first tier care
Human and disease, what to sequence? • Most mendelian diseases are caused by exome mutations • Exome is only ~1.6 % of human genome (50Mbp) Panel Exome Whole genome Genome >0,01% 1,6 % 95 % Sequencing 1/400x 1x 60x Interpretation ++ + + / - Validation ++ + + / - Speed ++ + - Cost (est.) € 500 € 700 € 3000
Whole genome sequencing X Ten Outsource ? $1000 genome $1000 genome 30x 40x
Comparision of exome and genome sequencing
Non invasive trisomy testing (NIPT) 10 weeks pregnancy 5% fetal DNA DNA isolation Prepare NGS Analysis Trisomy Report
NIPT: determine fetal chromosomal copy number Fetal cfDNA Maternal cfDNA Chr 21 Chr 21 Euploid Fetal Pregnancy Trisomy
Future of NGS
MinION USB sized sequencer One time use $ 900 dollar 500 nanopores > 1 Gbp User defined runtime Lifetime electrodes is limiting (days) No sample prep Measure directly from blood
Recommend
More recommend