challenges of ancient genomics and pan genomics
play

Challenges of ancient genomics and pan-genomics Kay Nieselt Center - PowerPoint PPT Presentation

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen University of Tbingen Overview Introduction: ancient DNA (microbial) paleogenomics Challenges of computational paleogenomics


  1. Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

  2. Overview Introduction: • ancient DNA • • (microbial) paleogenomics Challenges of computational paleogenomics • 2 |

  3. Ancient DNA - Paleogenomics With advent of NGS studying extinct organisms has become possible Most prominent extinct organisms are: Neandertals • Other humans, e.g. denisovans • Mammoth • Lately also: bacteria and viruses • Field of Paleogenomics: reconstruction and analysis of ancient genomes 3 |

  4. Microbial Paleogenomics Motivation: Emergence of human diseases • Evolution of Bacteria and Viruses • Co-evolution with host • 4 |

  5. Computational Paleogenomics 1. Genome assembly: reference-based (reconstruction of genome • via mapping) de novo • 2. SNP calling 3. Genome comparison 4. Phylogeny reconstruction 5 |

  6. Ancient DNA - Workflow 6 | Source of Figure: Janet Kelso, Keynote BioVis 2016

  7. Ancient DNA - Issues Short fragments: • mean fragment length between 60 and 150 bp Damaged: • aDNA chemically damages -> C->T (and G->A) conversion at 5’ (3’) end of fragment Metagenomes: • sample contains complex background (mixture of ancient and modern DNA) Contamination: • nature of complex background can also be due to contamination Low amount of endogenous DNA • 7 |

  8. Ancient DNA - Approaches Short fragments: • Produce paired-end reads longer than mean fragment length, after sequencing these are merged into a common single read Forward read Overlap Reverse read Merged read 8 |

  9. Ancient DNA - Approaches Damage: • used for authentication, afterwards often treated with UDG 9 |

  10. 1 Herbig et al, bioRxiv 2016 Ancient DNA - Issues Metagenomes: • sample contains complex background (mixture of ancient and modern DNA) Possible approach: Use Malt 1 to characterize • taxonomic content Low amount of endogenous DNA: • Approach: specific enrichment for target species Problems: modern reference, no de novo assembly,… 10 |

  11. 1 Peltzer et al, Genome Biol. 2016 EAGER 1 – a fully automated (ancient) genome reconstruction pipeline F a s t Q Preprocessing FastQC Clip&Merge FastQ Read Mapping BWA, BWA-mem … MapDamage QualiMap Samtools DeDup CircularMapper Preseq Schmutzi BAM SAM Genotyping UnifiedGenotyper GATK: Preprocessing VariantFiltration VCF2Genome HaplotypeCaller ANGSD FastA VCF Likelih. Genot. 11 |

  12. Challenge 1: Genome Reconstruction Question: how to reconstruct the genome of the DNA of interest from the raw reads by mapping against a reference (only one?), or • by de novo assembly, or • by a hybrid approach? • 12 |

  13. 1 Seitz and Nieselt, PeerJ Preprints 2016 Challenge 1: Genome Reconstruction Approaches: by mapping against a reference: • Problems: 1) single-reference mapping, 2) for aDNA only modern references used by de novo assembly: • Problems: low coverage, almost never mate pairs, ... Possible approach: MADAM 1 – improved ancient DNA genome assembly by hybrid of both: not aware how • 13 |

  14. Challenge 2: Calling SNPs To call SNPs: After mapping typical (best practice) approach: Apply tools such as GATK or freebayes or ANGSD • to mapping assemblies (i.e, bam files) Challenge : how to efficiently call SNPs from de novo assemblies when input is low coverage (ancient) DNA? 14 |

  15. Challenge 3: Genome Comparison Comparison of ancient and modern genomes : Large-scale variations: genomic rearrangements • (translocations, inversions) Small-scale variations: gene content, insertions, • deletions, SNPs 15 |

  16. Challenge 3: Genome Comparison Ancient genomes: mostly built from single common reference Modern genomes: assembled Challenge : how to compare? 16 |

  17. Challenge 3: Genome Comparison Comparative analyses based on genomic positions is challenging due to different coordinate systems across genomes. Possible approaches: compare genomes via one specific reference • But: genomic regions that cannot be aligned to the reference are lost. or 17 |

  18. 1 Herbig et al, Bioinformatics 2012 Build metareference that contains the superset of all genomes SuperGenome 1 (other call this pan-genome) 18 |

  19. SuperGenome Input: WGA of all known genomes of species of interest SuperGenome: Common coordinate system of all aligned genomes, independent of a prechosen reference genome, together with injective mapping of each genome onto SuperGenome 19

  20. The SuperGenome and Pan-Genome Nice „side effect“ of SuperGenome: it allows consistent assignment of coordinates to genomic • annotations (e.g. genes) straightforward determination of pan-genome* • by determining the orthologs from overlapping coordinates in the SuperGenome (* For us the pan-genome (aka supra-genome) is the full complement of all genes of a clade, e.g. species) 20 |

  21. Challenge 3A: WGA Sheer size of available genomes of a single species: currently up to thousands Challenge : How to compute a WGA of thousands of genomes? 21 |

  22. Challenge 4: Phylogenetics Phylogenetic trees from aDNA and modern genomes are reconstructed 1. to assess history of evolution, 2. to date `root’ and/or emergence of specific clades. 22 |

  23. Challenge 4: Phylogenetics How reliable are phylogenetic trees built from genomic data reconstructed from short reads? Typical workflow: 1. map short reads to a single reference sequence, 2. extract single nucleotide polymorphisms (SNPs), 3. infer phylogenetic tree from the aligned SNP positions. 23 |

  24. 1 Bertels et al, Mol. Biol. Evol. 2014 Biased phylogenies? - I 1. Source of common alignment? Reads mapped to one reference -> bias? Possible answer: REALPHY 1 (Reference sequence Alignment-based Phylogeny builder) 24 |

  25. 1 Bertels et al, Mol. Biol. Evol. 2014 2 McTavish et al, bioRxiv 2016 Biased phylogenies? - II 2. SNPs versus whole genome-based phylogeny? Bias of ML trees reconstructed from SNP-only positions shown by Bertels 1 Possible approach to further investigate: TreeToReads 2 - a pipeline for simulating raw reads from phylogenies 25 |

  26. Biased phylogenies? - III 3. Influence of `Ns´on phylogeny? Many aDNA samples suffer from low amount of endogenous DNA -> low coverage genomes -> missing positions (`Ns´) Question / Challenges: impact of `complete column deletion´versus • `partial column deletion´ threshold of coverage for `good´phylogeny • 26 |

  27. 3 Gardner et al, Bioinformatics 2015 4 Haubold, Brief. Bioinformatics 2013 Biased phylogenies? - IV 4. SNP-based phylogenies without alignment and/or genome reconstruction? Possible approach: kSNP3.0 3 5. General challenge: Alignment-free versus Alignment-based phylogenies? 4 27 |

  28. Summary: Challenges of ancient (pan-)genomics: 1. Reconstruction of ancient genomes from bacteria (or viruses) via mapping or assembly or hybrid approach 2. SNP calling of low coverage genomes 3. WGA of ancient and modern genomes 4. Phylogenetic tree reconstruction from ancient and modern genomes 28 |

  29. References Bertels F, Silander OK, Pachkov M, Rainey PB & van Nimwegen E (2014). Automated reconstruction of whole-genome phylogenies from short- sequence reads. Molecular biology and evolution , 31 , 1077-1088. Gardner SN, Slezak T & Hall BG (2015). kSNP3. 0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome. Bioinformatics (Oxford, England) , 31 , 2877. Haubold B (2014). Alignment-free phylogenetics and population genetics. Briefings in bioinformatics , 15 , 407-418. Herbig A, Jäger G, Battke F & Nieselt K (2012). GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28: i7–i15 Herbig A, Maixner F, Bos KI, Zink A, Krause J & Huson DH (2016). MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv , 050559. McTavish EJ, Pettengill J, Davis S, Rand H, Strain E, Allard M & Timme, RE (2016). TreeToReads-a pipeline for simulating raw reads from phylogenies. bioRxiv , 037655. Peltzer A, Jäger G, Herbig A, Seitz A, Kniep C, Krause J & Nieselt, K (2016). EAGER: efficient ancient genome reconstruction. Genome biology , 17 , 1. Seitz A & Nieselt K, Improving ancient genome assembly. PeerJ Preprints 2016. 29 |

Recommend


More recommend