Center for Biomics Next Generation Sequencing The basics Wilfred van IJcken Erasmus MC Center for Biomics Biomedical Research Techniques (XVIth ed.), Nov 6
Learning objectives Next generation sequencing (NGS): The basics Background Illumina sequencing technology Terminology Next presentation Research applications Diagnostic applications Future directions
What is next generation sequencing? Sequencing technology developed after Sanger Millions of reads in parallel (MPS) Shorter (<400bp) sequencing reads Enables analysis of complex mixtures of DNA or RNA Enables genome wide approach Different vendors with different approaches MPS = massive parallel sequencing
NGS systems on the market Desktop High Throughput Special Different characteristics Sequencing technology Readlength Speed Output Applications Run cost
Illumina systems 6 Tb per run Data amount HiSeq X Ten NovaSeq6000 HiSeq 4000 HiSeq 2500 Run costs 8 Gb NextSeq 500 Purchase cost MiSeq MiniSeq
NGS flow Intake Isolate Library Sequence Report yield ID DNA or Select chemistry quality RNA enzymes amount region of sex interest Variation disease blood detection plasma PCR signal Match phenotype? saliva capture FFPE cells
DNA library prep
Sequencing by Synthesis cluster generation lane flowcell
Bridge amplification
Sequencing incorporated
Sequencing and basecalling Read 1 A G T C Image acquisition 1 2 3 4 5 6 7 8 9 Base calling C A A G T A A C …
SingIe-end, paired end, index read Index read Single Read GATCG Paired end read Single read = sequence from one side of the fragment Paired end = sequence from both sides of the fragment
Indexing enables sample multiplexing Index Patient 1 GATCG Patient 2 CGTGA ATCGG Patient 3 TCTCT Patient 4 Index = different nucleic acid code per sample introduced during sampleprep read during index read Enables multiple samples in one flowcell lane
Sequence Index 1
Sequence Index 2
Sequence Read 2 Image acquisition 1 2 3 4 5 6 7 8 9 C A A G T A A C …
Summary sequencing technology Index 2 Read 2 Read 1 Index 1
Simplified RNA sample preparation DNA RNA Reverse transcriptase Adaptor 1 Adaptor 2
Output file from basecalling Many file types: qseq, fastq, etc… C A A G T A A C … Each system own format. Large file sizes: >400 million reads per lane Instrument PF (0,1) X-coord Y-coord Index # Read # Run ID Lane Tile Sequence ASCII Character Q-score
Data analysis not trivial due to data volumes and complexity Data Volume Total Final Comment HiSeq 2000 200G run Image Data 32 TB 0 Intensity Data 2 TB 0 Optionally transferred 1 byte/base (raw) assuming Base Call / Quality Score Data 0.25 TB 0.25 TB qseq generation offline Alignment Output 6 TB (3 TB) 1.2 TB Remove intermediate files GA IIx 50G run 150 M reads x 8 lanes x 100 bp x 2 (paired end) = 240 Gbp Image Data 6.9 TB 0 Optionally transferred Intensity Data 0.93 TB 0.93 TB Storage and compute needed Base Call / Quality Score Data 0.17 TB 0.17 TB Alignment Output 1.2 TB 1.2 TB Core facilities
Terminology Next generation sequencing, AKA: - Deep sequencing - MPS = massive parallel sequencing Cluster # of sequencing cycles 1 2 3 4 5 6 7 8 9 = readlength T G C T A C G A T … Read
Alignment, Mapping Reference sequence AAAACGCGCTTAGCCTTT T TTCGACTGTCGAGTGGA A CGCCGCTAGCTAGGCGC Heterozygous SNP mismatch Consensus sequence AAAACGCGCTTAGCCTTT T TTCGACTGTCGAGTGGA T CGCCGCTAGCTAGGCGC TAGCCTTT T TTCGACTGTCGAGTGGATCGCCG AGCCTTT T TTCGACTGTCGAGTGGATCGCCGC GCCTTT G TTCGACTGTCGAGTGGATCGCCGCT CCTTT G TTCGACTGTCGAGTGGATCGCCGCTA
Read depth Aka depth of coverage 1 5 7 AAAACGCGCTTAGCCTTT T TTCGACTGTCGAGTGGA T CGCCGCTAGCTAGGCGC TAGCCTTT T TTCGACTGTCGAGTGGATCGCCG AGCCTTT T TTCGACTGTCGAGTGGATCGCCGC GCCTTT G TTCGACTGTCGAGTGGATCGCCGCT CCTTT G TTCGACTGTCGAGTGGATCGCCGCTA GACTGTCGAGTGGATCGCCGCTAGCTAGG CTGTCGAGTGGATCGCCGCTAGCTAGG Average read depth can differ a lot from read depth !
Accuracy, error rate, quality score Single base error rate = Total number of mismatched bases found in mapped sequence reads from a sequencing run, divided by the mappable yield. Quality scores (Q scores / phred scores) - derived from an examination of the intensity peaks around each base - range from 0 – 41, higher corresponds to higher quality - Q = -10log 10 p, p is basecall error probability Quality score Probability of Base call incorrect base call accuracy 10 (Q10) 1 in 10 90% 20 (Q20) 1 in 100 99% 30 (Q30) 1 in 1000 99.9%
Traditional vs NextGen Sequencing Sanger sequencing: 1 sequence read per basepair NGS: Multiple sequence reads per basepair
Erasmus Center for Biomics Genomics core facility at ErasmusMC www.biomics.nl w.vanijcken@erasmusmc.nl LNA
Recommend
More recommend