a base composition analysis of natural patterns for the
play

A base composition analysis of natural patterns for the - PowerPoint PPT Presentation

Introduction Contribution Materials and Methods Results Conclusions A base composition analysis of natural patterns for the preprocessing of metagenome sequences Oliver Bonham-Carter, Dhundy Bastola, Hesham Ali College of Information Science


  1. Introduction Contribution Materials and Methods Results Conclusions A base composition analysis of natural patterns for the preprocessing of metagenome sequences Oliver Bonham-Carter, Dhundy Bastola, Hesham Ali College of Information Science & Technology School of Interdisciplinary Informatics Peter Kiewit Institute University of Nebraska at Omaha Omaha, NE USA 26 April 2013

  2. Introduction Contribution Materials and Methods Results Conclusions Introduction 1 Problem Motivation Contribution 2 Our Study Materials and Methods 3 Spectrum Sets Examples Of Spectrum Sets Proportions Data Experiment and Flow Chart Association Results 4 Phylogeny Conclusions 5 References

  3. Introduction Contribution Materials and Methods Results Conclusions A Preprocessing Step to de novo Sequencing The reconstruction of a genetic sequence is done by merging smaller pieces ( reads ) together to make the whole. Contigs are made of combined reads.

  4. Introduction Contribution Materials and Methods Results Conclusions Sequence Assembly: Similar to a Jigsaw Puzzle Smaller pieces come together to build the whole.

  5. Introduction Contribution Materials and Methods Results Conclusions Mixing Pieces Makes a Harder Jigsaw Puzzle Puzzle building is frustrated by the addition of foreign pieces in the mix.

  6. Introduction Contribution Materials and Methods Results Conclusions Assembly of multiple sequences by de novo technologies Often there are multiple sequences present in the sequencing pool.

  7. Introduction Contribution Materials and Methods Results Conclusions Sequencing Alignment End-regions of reads are analyzed for adjacency properties. Increased analysis is now necessary due to the added foreign reads.

  8. Introduction Contribution Materials and Methods Results Conclusions Contribution of this Study: Base Composition Analysis We propose a statistical method to cluster related sequence data into groups. This step will reduce the search space when aligning the individual reads of the pool. Verified by synthetic and biological data.

  9. Introduction Contribution Materials and Methods Results Conclusions Contribution of this Study: Base Composition Analysis

  10. Introduction Contribution Materials and Methods Results Conclusions Spectrum Sets from Restriction Sites for Statistical Analysis Restriction sites are regions in foreign DNA (i.e., viruses) where bacterial enzymes cut to destroy the DNA of an invading threat.

  11. Introduction Contribution Materials and Methods Results Conclusions Spectrum Sets from Restriction Sites for Statistical Analysis Restriction sites to create a list of DNA words ( spectrum sets ). The proportional content of all these words ( motifs ) is used to determine sequence relatedness.

  12. Introduction Contribution Materials and Methods Results Conclusions Four Spectrum Sets From All Known RS’s (Length 6)

  13. Introduction Contribution Materials and Methods Results Conclusions Examples of “Home Grown” Spectrum Sets The RS’s Used by Clostridium and Staphylococcus are different.

  14. Introduction Contribution Materials and Methods Results Conclusions Collecting Proportions of Motifs Over Sequence Data: Length 6 Motifs Where m i is a motif, S L is a read sequence, count ( m i ) is the number of occurrences of m i found in S L , | m i | and | S L | are the lengths of the motif and sequence, respectively. Since we are not using the entire sample space (all possible length- n motifs), proportions were appropriate.

  15. Introduction Contribution Materials and Methods Results Conclusions Organisms Organism Contig Originator Division Bifidobacterium longum NC 004307 Actinobacteria Mycobacterium bovis NC 002945 Actinobacteria Clostridium tetani NC 004557 Firmicutes NC 007622 Firmicutes Staphylococcus aureus Burkholderia pseudomallei NC 012695 Proteobacteria NC 008787 Proteobacteria Campylobacter jejuni � 6 � Ten trials of = 10 * 15 = 150 experiments, each with fresh 2 sequence reads. The Contig Originator column displays the fully assembled sequences processed via MetaSim 1 to make synthetic reads. 1, MetaSim: http://ab.inf.uni-tuebingen.de/software/metasim/

  16. Introduction Contribution Materials and Methods Results Conclusions Flowchart of Algorithm

  17. Introduction Contribution Materials and Methods Results Conclusions Association DNA sequences appear to naturally have a unique base composition. Related sequences cluster ( associate ) together.

  18. Introduction Contribution Materials and Methods Results Conclusions Clostridium and Staphylococcus Genomes, CCCGGG -Spectrum Set Note: Clostridium Staphylococcus

  19. Introduction Contribution Materials and Methods Results Conclusions Clostridium and Staphylococcus Genomes, AAATTT -Spectrum Set Note: Clostridium Staphylococcus

  20. Introduction Contribution Materials and Methods Results Conclusions An Unknown Sequence Joins the Pool Party

  21. Introduction Contribution Materials and Methods Results Conclusions An Unknown Sequence Joins the Pool Party Addition of Clostridium Sequence Note: Spectrum-Set CCCGGG Clostridium Staphylococcus

  22. Introduction Contribution Materials and Methods Results Conclusions An Unknown Sequence Joins the Pool Party Addition of Clostridium Sequence Note: Spectrum-Set AAATTT Clostridium Staphylococcus

  23. Introduction Contribution Materials and Methods Results Conclusions Clostridium and Staphylococcus , CCCGGG -Spectrum Set, with Clostridium Contigs Note: Clostridium Staphylococcus

  24. Introduction Contribution Materials and Methods Results Conclusions Mixed Contigs: Clostridium , Staphylococcus and Burkholderia (Bacterial Genomes) AAATTT -Spectrum: There is a high contrast between one of the three to the other two. Remove this set and rerun the test. Note: Clostridium Staphylococcus, Burkholderia

  25. Introduction Contribution Materials and Methods Results Conclusions Phylogeny Remark We successfully used our method to assign the first chromosomes of the following organisms to their rightful phylogenetic groupings. Organism Common Name Worm Caenorhabditis elegans Canis lupus familiaris Dog Drosophila melanogaster Fruit fly Mus musculus Mouse Mycoplasma hyorhinis GDL-1 Bacteria Rabbit Oryctolagus cuniculus Rattus norvegicus Rat

  26. Introduction Contribution Materials and Methods Results Conclusions Taxonomy Tree from Genbank http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi

  27. Introduction Contribution Materials and Methods Results Conclusions CCGGAT - Best Tree Note: The heatmap graphic is removed to show only the tree.

  28. Introduction Contribution Materials and Methods Results Conclusions AAATTT - Second Best Tree Note: The heatmap graphic is removed to show only the tree.

  29. Introduction Contribution Materials and Methods Results Conclusions Limitations Information Based Successful phylogeny grouping requires ample sequence data ( > 700bps of sequence data). Next generation sequencing trends indicate that improving sequencing technology is growing longer reads each year. Contigs: longer sequences made from combined reads are suitable. Spectrum Set Behavior Spectrum sets do not perform similarly on each data set. Contrast-based analysis: knowledge of organismal natural uses of restriction sites.

  30. Introduction Contribution Materials and Methods Results Conclusions Conclusions: Preprocessing for Multi-Sequence Assembly The Separation of Mixed Reads. We proposed a binning preprocessing method which separates and partitions related sequence data. This method can reduce the search space when aligning reads in assembly tasks to expedite the sequence assembly process. The structural properties of sequence material can be used to infer phylogenetic properties

  31. Introduction Contribution Materials and Methods Results Conclusions References Bonham-Carter O, Ali H, Bastola D, “ A base composition analysis of natural patterns for the pre-processing of metagenome sequences ,” BMC Bioinformatics . (accepted, 2013) Bonham-Carter O, Ali H, Bastola D, “ A Meta-genome Sequencing and Assembly Preprocessing Algorithm Inspired by Restriction Site Base Composition ”, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) Wang Y, Leung HC, Yiu SM, Chin FY, Bioinformatics. “MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample,” 2012 Sep 15;28(18):i356-i362.

  32. Introduction Contribution Materials and Methods Results Conclusions We would like to thank the support students, faculty and staff in the UNO- Bioinformatics Core Facility. This project has been funded by the grants from the National Center for Research Resources (5P20RR016469) and the National Institute for General Medical Science (NIGMS) (8P20GM103427). Thank You! Questions? obonhamcarter@unomaha.edu IS&T Bioinformatics http://bioinformatics.ist.unomaha.edu/

  33. Introduction Contribution Materials and Methods Results Conclusions Motifs Set Seed Available Motifs AAATTT 12 CCCGGG 12 AATTCG 156 CCGGAT 156 The numbers of available motifs belonging to each spectrum. The motifs in the spectrum set are non-palindromic and are permutations of the set seeds.

Recommend


More recommend