mutation detection in massively parallel sequencing
play

Mutation detection in massively parallel sequencing 2012 Winter - PowerPoint PPT Presentation

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and Computational Biology Ann-Marie Patch Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research


  1. Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and Computational Biology Ann-Marie Patch

  2. Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research Virology – 11 new complete genomes in this month’s issue of Journal of Virology

  3. Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research Bacterial genome sequencing BMC Microbiology

  4. Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research Fungi and Plant genomes sequencing

  5. Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research Vertebrate Evolution nature

  6. Sequencing literature From medicine to evolution - large scale sequencing data is impacting all of our research Human cancer genetics nature

  7. Why do we want to sequence? ICGC aims to obtain a comprehensive description of genomic, transcriptomic, and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical importance across the globe http://www.icgc.org/ International Cancer Genome Consortium Nature 2010website The 1000 Genomes Project is an international collaboration to produce an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies The 1000 Human Genomes Project Consortium Nature 2010 http://www.1000genomes.org/ Basic level: Comparing the characteristics of two or more populations of cells and recording the differences Talk on metagenomics tomorrow morning

  8. How many changes are we going to find ? Human genome Illustration The genomes of any two people will typically have: 1 single base difference every ~1000bp (0.1%) Add in all other classes of mutations and typically <0.5% of two human genomes are different In Craig Venter’s genome 4.1 million DNA variants were reported (12,500 in the exome ) encompassing in total a huge 12.3 Mb (0.41%) Levey et al 2007, Ng et al 2008

  9. What sort of differences will we find in DNA sequencing? Mutation = a change in nucleic acid sequence • Single Nucleotide variations (SNV) • Small insertions and deletions (INDELS) • Large chromosome rearrangements • Copy number changes RNA talks later today Cloonan 2010

  10. Mutation Detection • Sequencing Basics Small mutations, SNVs and indels • • Genomic rearrangements • Copy number changes • Verification

  11. Sequencing process recap - library preparation Genomic DNA Sheared DNA fragments Captured or Size selected DNA fragments Paired-end sequencing

  12. Sequencing process recap - library preparation Genomic DNA Sheared DNA fragments Captured or Size selected DNA fragments Align back to a reference genome

  13. Read pairs are mapped to a reference genome Reference genome I II I I II Coverage depth Paired-end sequences mapped to genome How many reads stack on top of each other at any one position is called the coverage depth Examining how the mapping position and content of the pairs of reads vary across the reference genome allows us to determine mutations and structural rearrangements

  14. Read pairs are mapped to a reference genome Reference genome I II I I II * * * Copy Number Copy Number Small Single Loss Large Loss insertions and Nucleotide Homozygous chromosome Heterozygous deletions Variant rearrangement (INDELs) Copy Number Gain We convert our data into positional information and counts “How many bases out of the total are different to the reference at any position”

  15. Mutation Detection • Sequencing Basics Small mutations, SNVs and indels • • Genomic rearrangements • Copy number changes • Verification

  16. Detecting single nucleotide variants (SNVs) 20 30 10 Reference ACGATATTACACGTACACTCAAGTCGTTCGGAACCT ACGATATTACACGTACATTCAAATCGT ACGTTATTACACGTACATTCAACTCGT ACGATATTACACGCACATTCAAGTCGT Coverage Aligned Reads CGATCTTACACGTACATTCAAGTCGTT ATATTTCACGTACATTCAAGTCGTTCG ATATTAAA-GTACATTCAAGTCGTTCG ATTACACGTACATTCAAGTCGATCGGA ATTACACGTACATTCACGTCGTTCGGA CACGTACATTCGAGTCGTTCGGAACCT SNV call -----------------T------------------ Mutation = Homozygous 18 C>T 9 bases out of a total of 9 reads covering this position do not match the reference

  17. What counts are acceptable? 4 out of 9 Heterozygous C>T SNV call 2 out of 9 Heterozygous C>T ? How good is the data? SNV call

  18. Controlling the quality of sequencing data Filtering data can take place at various stages Pre-filter reads Pre-alignment Variant calling • Remove or trim reads where base quality is low e.g. SolexaQA (Cox et al 2010) Evaluate Variants Per base quality at the 3’ end of sequence reads Annotate Variants 64 65 66 67 68 69 70 71 72 73 74 75 76 Rank Variants q20 Verify Quality Variants Talk this afternoon on trimming and errors

  19. Control the quality of input data before calling variants Alignment thresholds • Set minimums for mapping quality Pre-filter reads Post-alignment Variant calling • Marking duplicates e.g. Picard (http://picard.sourceforge.net) • Set maximum number of mismatches for a read Evaluate • Flagging reads that map to more than one location in the genome Variants Annotate Variants Rank Variants Verify Variants 3 mismatches PCR duplicate reads

  20. Software for calling variants Many software tools available ...more than are listed here Pre-filter reads GATK – McKenna et al 2010 Genome Res SAMtools (mpileup and BCFtools) Li 2009 Bioinformatics Variant calling DiBayes – SOLiD software http://www.lifetechnologies.com InGAP – Qi 2011 Nucle Acids Res Evaluate MAQGene – Bigelow 2009 Nat. Methods C. elegans only Variants PolyBayesShort - http://bioinformatics.bc.edu/marthlab/PbShort SomaticSniper – Larson 2012 Bioinformatics Annotate Sniper – Simola 2011 Genome Biol Variants Strelka – Saunders 2012 Bioinformatics Rank Dindel – Albers 2011 Genome Res Variants SNiPlay – Dereeper 2011 BMC Bioinformatics SRiC – Zang 2011 BMC Genomics Verify qSNP – QCMG manuscript in preparation Variants http://seqanswers.com/wiki/Software/list

  21. Evaluate the overall calling of variants by the software SNP concordance with genotyping arrays Pre-filter reads Genotyping array calls Variant calling Germline variants Illumina Illumina array array (+) (-) Evaluate Sequencing calls Variants SOLiD sequencing 339,935 1,453 (+) Annotate True Positives False Positives Variants SOLiD sequencing 5,806 434,554 (-) False Negatives True Negatives Rank Variants (+) variant called by technology (-) variant not called Verify Variants Sensitivity 97% TP/(TP+FN) Specificity 99% TN/(TN+FP) Effective median coverage 37

  22. Visualise the variants called by the software Visualising SNV calls in IGV Pre-filter reads Variant calling Evaluate Variants Assess coverage and Annotate quality Variants Rank Check for hidden Variants duplicates Verify Examine sequence Variants context (IGV: http://www.broadinstitute.org/software/igv/)

  23. Visualise the variants called by the software Visualising small INDELS in IGV Pre-filter reads Variant calling Assess coverage and Evaluate quality Variants Annotate Check for hidden Variants duplicates Rank Examine sequence Variants context Verify Variants (IGV: http://www.broadinstitute.org/software/igv/)

  24. Solutions for annotating variants SeattleSeq - http://snp.gs.washington.edu/SeattleSeqAnnotation131/ MU2A - Garla V et al 2010. Bioinformatics Pre-filter reads Segtor - Renaud et al 2011 Plos One Galaxy - http://galaxy.psu.edu/ ANNOVAR - http://www.openbioinformatics.org/annovar/ Variant calling Ensembl Perl API - http://www.ensembl.org And more ... Evaluate Variants Annotate Variants Rank • Downstream, Upstream (5kb) Variants • Intergenic • Intronic Verify • Essential Splice site Variants • 5’UTR, 3’UTR • Synonymous Coding • Non-Synonymous Coding • Stop gained, Stop lost • Within non-coding gene • mi RNA

  25. Measuring how damaging a variant might be Rank variants on a number of characteristics Pre-filter reads • Conservation of amino acid across species e.g. GERP, phastCons, PhyloP • Assess potential for damage using e.g. Sift, PolyPhen2, MutationTaster • Manual curation i.e. Variation within a known candidate gene or locus of Variant calling interest Evaluate Variants Annotate Variants Rank Variants Verify Variants Polyphen2 (http://genetics.bwh.harvard.edu/pph2/) Talk on Thursday about Polyphen Adzhubei 2010, Sift Kumar 2009, MutationTaster Schwarz 2010 GERP Davydov 2010, Goode 2010 phastCons and phyloP Pollard 2009, Siepel 2005 pathway analysis

  26. Summary of workflow for mutation detection QCMG workflow Pre-filter reads Pre-filter: • remove duplicates Variant calling • alignment length >34 or (F5 and in proper pair) • mapping quality > 14 • less than 3 mismatches to reference Evaluate Variants qSNP: Annotate • Pileup of variants in Tumour and Normal bams Variants • Coverage minimum of 12 reads in the normal • Calls somatic if not in pileup of matched normal Rank • Flags if variant has been seen in a the normal of another patient Variants • Annotation using Ensembl API Verify Evaluation of variants: Variants • > 3 novel starts supporting mutation/variant • Coverage • IGV review

Recommend


More recommend