errors biases and quality control in next gen sequencing
play

Errors, biases and Quality control in Next Gen Sequencing Dr David - PowerPoint PPT Presentation

Errors, biases and Quality control in Next Gen Sequencing Dr David Humphreys d.humphreys@victorchang.edu.au - Lab scientist : Bioinformatician - RNA biologist - small RNAs (miRNA) Victor Chang Cardiac Research Institute, Sydney, Australia


  1. Errors, biases and Quality control in Next Gen Sequencing Dr David Humphreys d.humphreys@victorchang.edu.au - Lab scientist : Bioinformatician - RNA biologist - small RNAs (miRNA) Victor Chang Cardiac Research Institute, Sydney, Australia

  2. Testing hypothesis and theories Data points Errors/Biases: HTS/NGS - Present in all experiments - Be aware/informed - Minimise - Test 1994 2009 2013 ME! ??? You??? 2013 Time line Next generation sequencing: - Series of experiments - Biases/error accumulate!

  3. Anscombe’s Quartet Anscombe F.J (1973) American Statistician Image source: Wikipedia Maths is a tool for analysis. • You can blindly ignore biases and errors in data sets. • - mean, stdev, variance, correlation are the same!

  4. High Throughput Sequencing Molarity Cores CPU Workflow: Genes Fluoresence RAM Scripts Gels Genome Absorbance Threads SNPs Stains Titrations Command line Sample Library Clonal Sequencing Bioinformatics preparation preparation amplification Quantification Purity Cummulative Error Challenges: (1) Awareness (2) QC considerations Community Time Cost Network; Throughput Literature Consumption Sensitivity/specificity

  5. Quantification: Nanodrop spectrophotometer * http://seqanswers.com/forums/showthread.php?t=21280 Quick • Consumes 1-2ul sample • http://www.nanodrop.com/Library/CVStech_17_11_FINAL.pdf Large dynamic range • Contaminants: (10 – 10,000ng/ul) 230nm: EDTA, carbohydrates, sodium acetate * , tris * 270nm: Phenol ( plus at 230nm * ) Can identify contaminations 280nm: DTT • Ratios 260/280 : 1.8 (DNA) 2.0 (RNA) 260/270 : 1.2 – 1.3? 260/230 : 2.0 – 2.2 Solution: Re-precipitate/buffer exchange ! WARNING ! WARNING Careful of accuracy < 50ng/ul • Contaminants can impact on downstream • Careful of concentrations > 1ug/ul • enzymatic reactions Does not assess quality!! •

  6. Quantification: Qubit fluorimeter More sensitive than nano-drop • Consumes small amount of sample • Specific assays • ! WARNING Known biases in quantifying ssRNA < 50ng/ul • Cannot quantitate ssDNA in presence of dsDNA •

  7. Quantification Agilent Bioanalyzer * RNA integrity index (RIN) Chip Application Quantitative range Total RNA * 5-500ng/ul mRNA 25-250ng/ul Total RNA * 50-5000pg/ul mRNA 250-5000pg/ul dsDNA 5-500 pg/ul (50-7000bp) - Use at least 50ng for meaningful RIN Consumes small amount of sample • Quantification Schroeder et al (2006) BMC Mol Bio. • Estimating nucleic acid size • WARNING ! Each chip has a quantitative range Limitations on size range • • Sensitive to salts. Not accurate quantitating broad smears • •

  8. Sample Library Clonal Sequencing Bioinformatics preparation preparation amplification Sample Purification/Assessment/Processing Criteria RNA DNA QC High complexity Trizol vs column Phenol:chloroform qPCR, Northern based vs column based blotting?? High quality RIN > 8 Unfragmented Bioanalyzer, gel electrophoresis Accurate pg - ng - ug pg - ng - ug Qubit/Nanodrop, Quantification Agilent Bioanalyser Contamination A260/280 = 2 A260/280 = 1.8 Qubit, Nanodrop (salts, organics) A260/230 >2 A260/230 >2 Enrichment Deplete ribosomes Exome capture qPCR/Agilent Fragment Uniform peaks better than broad Agilent 1) Library manual as provided by the manufacturer 2) http://nxseq.bitesizebio.com/articles/ GOAL: to have a final sample with high complexity

  9. Sample Library Clonal Sequencing Bioinformatics preparation preparation amplification Purification Kim et al., (2012) biases Molecular Cell 46, 893-895 Kim et al., (2011) Ratio 141/200c Molecular Cell 43, 1005-1014 Cell number Low = 500,000 High = 800,000 Small RNA ppt with longer RNA • 1mL Most susceptible: Trizol • Low GC content, 2ndary structure Library prep + Sequence Cell number miRNAs: (L) = 200,000 -141 -29b -21 -106b -15a -34a (H) = 800,000 decreased in cells grown at low confluence/loss of adhesion

  10. Sample Library Clonal Sequencing Bioinformatics preparation preparation amplification miRNA Hafner et al., (2011) “ RNA-ligase-dependent biases in miRNA ….. cDNA libraries” library RNA 17(9), 1-16 biases Input: Pool A = Equimolar - 770 synthetic miRNAs - 45 designed RNAs Pool B = 10 fold serial dilution Ligation biases PCR bias Reverse Transcription bias -Enzyme Dilute 1:10000 Not a significant source of 5 x -Temperature 10 PCR cycles sequence specific biases -Sequence - No appreciable distortion! ! WARNING Don’t compare NGS data sets from different library preps • Be consistent with incubation times/temperatures •

  11. Sample Library Clonal Sequencing Bioinformatics preparation preparation amplification Sequencing platforms Ion torrent Kapa Biosystems Illumina Standard reagents Complete genomics Flowcell/lane variations do occur Smaller than those observed between platforms Ross et al., Characterizing and measuring bias in sequence data. Genome Biology 2013 • Bragg et al., Shining a light on Dark sequencing characterising errors . PLoS Comp Biol 2013 • Loman et al., Performance comparison of benchtop HTS platforms. Nature Biotech 2012 • Quail et al., Tale of three NGS platforms . BMC Genomics, 2012 • Lam et al., Performance comparison of whole genome sequencing platforms . Nat Biotech 2012 •

  12. Sample Library Clonal Sequencing Bioinformatics preparation preparation amplification Raw sequencing files Assessing sequence quality Align (pipeline) Assessing alignment data

  13. The Basics: VCCRI Raw File types : fastq, csfasta, qual, fasta, xsq sequencing files Header : Coordinates/other Sequence : A T C G N/. Assessing sequence Quality values : Phred score quality Align (pipeline) Assessing alignment data 0 10 20 30 40 Numerical : . . . . ! “ # $ % & ‘ ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I Phred+33 : @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h Phred+64 :

  14. VCCRI Raw sequencing files Free java utility that can assess QC metrics of HTS data sets. • - GUI - Command line Assessing - Can create html output sequence quality fastq (standard, gzip, colorspace, casava), SAM/BAM • Align (pipeline) Assessing alignment data Not all data sets require full complement of green ticks!!

  15. Raw sequencing files Assessing 90% Very good 75% sequence Median quality Reasonable Mean Align (pipeline) Poor 25% Assessing 10% alignment data

  16. VCCRI Raw sequencing files Identifies if subset of sequences have low quality Assessing sequence quality May identify cycles that are unreliable Align Identify adaptors (pipeline) and primers Helps assess raw data files prior to mapping - low quality data may cause incorrect alignments Assessing - low quality data may incorrectly call variations alignment - Sequence with trailing adaptor sequences will not map data

  17. Aligners Raw sequencing files Be aware of the default options - Accepted Errors - Multimappers Assessing sequence quality Different aligners can give different results. Benchmarking short sequence mapping tools Hatem et al (BMC Bioinformatics, 2013) Align Reference (pipeline) Choose a suitable reference. Include mitochondrial sequence Design a filter set to capture repeated sequences (rRNA, tRNA) Assessing alignment data

  18. Assessing alignment data Include a filter % mapped Mapping statistics Raw sequencing % mapped at what length files Pass Questionable Alignment feature statistics Filter raw data Assessing - Coverage - Filter sequence - Expression - Trim quality - Discovery Test Align (pipeline) Assessing ! Important alignment data Know your mapping statistics • Know what to expect from your data sets • Test on existing data set •

  19. Take home messages NGS is a collection of experiments • Biases/errors can/will occur at all steps of a high throughput sequencing study • QC measures should applied at all steps of a high throughput sequencing study • Don’t be alarmed, stay informed • Be familiar with existing data sets

  20. miRNA sequencing profiling miRspring Humphreys D.T., and Suter C.M. Nucleic Acids Research 2013 . http://miRspring.victorchang.edu.au Small (<2MB) HTML document that replicates the miRNA aligned sequencing data. • Needs NO internet connectivity. • Provides visualization of sequence data • Reports on miRNA processing • Complete transparency. •

  21. microRNAs miRspring reporting tools Small non-coding RNAs (22nt) • Bind to 3’UTRs � decay and/or translational repression • Biogenesis: Derived from longer stem loop precursors • i) 5’ isomiRs iii ii) 3’ isomiRs A � G ii i vi C � T iii) Non-canonical v v iv) Arm bias v) miRNA length i ii vi) RNA editing 5’ 3’ iv

  22. miRspring miRNA clusters Mono-cistronic Poly-cistronic Genomic Genomic miRNA Seed analysis miR-196a UAGGUAGUUUCCUGUUGUUGGG AGGUAGU let-7a UGAGGUAGUAGGUUGUAUAGUUU GAGGUAG let-7a UGAGGUAGUAGGUUGUAUAGUUU

Recommend


More recommend