amrtime
play

AMRtime Precise identification of antimicrobial resistance - PowerPoint PPT Presentation

AMRtime Precise identification of antimicrobial resistance determinants from metagenomic data Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University Table of contents 1. Background 2.


  1. AMRtime Precise identification of antimicrobial resistance determinants from metagenomic data Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University

  2. Table of contents 1. Background 2. AMRtime Overview 3. Filtering out non-AMR reads 4. Sensitive Homology Classification 1

  3. Background

  4. AMR-metagenomics Genomes Sequencing Reads AMR detection AMR Genes 2

  5. Comprehensive Antibiotic Resistance Database card.mcmaster.ca 3

  6. Why is AMR metagenomics difficult?

  7. AMR genes are rare genomically AMR Reads in Metagenome (0.643%) log(Read Count) 10 8 10 7 All (~324M) AMR (~2.1M) 2184 CARD-Prevalence Genomes at 1-10X abundance 4

  8. AMR genes have wildly different abundances 1236 AMR PATRIC genomes 5

  9. AMR sequence space overlaps MDS of CARD Proteins BLASTP-%ID Actual Families Affinity Clusters (Adj. Rand=0.30041) 1000 1000 500 500 0 0 500 500 1000 1000 1000 500 0 500 1000 1000 500 0 500 1000 6

  10. AMRtime Overview

  11. AMRtime structure Input files Metagenomic Reads Processes AMR Filtering Intermediate files Output files Filtered reads CARD Sensitive Homology Classification Homology predictions Variant Identification Metamodels Variant predictions Metamodel predictions 7

  12. AMRtime structure Input files Metagenomic Reads Processes AMR Filtering Intermediate files Output files Filtered reads CARD Sensitive Homology Classification Homology predictions Variant Identification Metamodels Variant predictions Metamodel predictions 8

  13. AMRtime structure Input Files Metagenomic Reads CARD Processes Read Filtering Intermediate Files Output Files Filtered Reads Features Sensitive AMR Classification ARO Predictions 9

  14. Filtering out non-AMR reads

  15. Testing sequence similarity search tools ESKAPE Genomes Resistance Gene Identi fi er ART Read Simulator + CARD Labeled Simulated Metagenome ORFM Predicted ORF Protein Sequences NT Query & NT CARD NT Query & AA CARD AA Query & AA CARD Database Methods Database Methods Database Methods - BLASTN - BLASTX - BLASTP - bowtie2 - DIAMOND BLASTX - DIAMOND BLASTP - BWA-MEM - PALADIN - HMMSearch - biobloom* - groot - HMMSearch 10

  16. Terminology refresher interlude https://commons.wikimedia.org/wiki/File:Precisionrecall.svg 11

  17. DNA subject best for precision, Protein subject best for recall 1.0 Domain DNA Query/DB DNA Query, Protein DB 0.8 Protein Query/DB Precision 0.6 0.4 0.2 0.00 0.25 0.50 0.75 1.00 Recall Simulated MiSeq v3 250bp reads, 30.31M reads (7.21M AMR derived) 12

  18. K-mer methods perform poorly 1.0 Paradigm BWT BLAST 0.8 k-mer HMM Precision 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Recall BWT: bowtie2, bwa-mem, paladin; BLAST: blast, diamond; HMM: 13 hmmsearch; K-MER: biobloom, groot.

  19. DIAMOND-BLASTX best compromise 1.00 Tool blastx 0.98 bwa diamond_blastx paladin 0.96 Precision blastp diamond_blastp 0.94 0.92 0.90 0.90 0.92 0.94 0.96 0.98 1.00 Recall DIAMOND-BLASTX ‘more sensitive’ setting (min < 1 e − 10 ): 4.926 hours with 2 cores and 8.3Gb of memory. AMR Reads: 7.15M detected, 59.26K missed, 1.87M false positives. 14

  20. Why not just use these sequence searches?

  21. Poor gene-level accuracy ARO Accuracy groot diamond_blastp diamond_blastx blastp Tool blastx paladin blastn bowtie2 bwa 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of reads per ARO correct Performance at optimal settings for ARO accuracy 15

  22. Good family-level accuracy Correct Family groot hmmsearch_nt bowtie2 bwa hmmsearch_aa Tool blastn paladin diamond_blastp diamond_blastx blastx blastp 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of reads per family correct Performance at optimal settings for Family accuracy 16

  23. Sensitive Homology Classification

  24. Initial classifier Training Data Classifier ARO predictions 17

  25. Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 17

  26. Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 % 17

  27. Revised classifier structure: exploiting the ARO Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family ... SMOTE Family N SMOTE Family 1 Data Family ... Data Family N Data Family 1 Classifier Family ... Classifier Family N Classifier ARO predictions 18

  28. Read encoding gene 1 gene 2 gene j − 1 gene j ...   1256 0 0 63 read 1 ... 0 0 0 0 read 2   ...   Sequence bitscore matrix =   ...  ... ... ... ... ...    0 512 0 0 read i − 1  ...  0 0 785 129 read i ... Advantages: read length invariant, low dimensionality, uses filtering data 19

  29. Held-out test results Normalised Bitscore Random Forest 1.00 0.75 Proportion 0.50 0.25 0.00 Precision Recall Family Test Peformance Mean Precision: 0.995, Mean Recall: 0.985 20

  30. ARO level classification more variable Median Precision-Recall Within Families 1.00 Precision Recall 0.75 Proportion 0.50 0.25 0.00 0 25 50 75 100 125 150 175 200 225 Ordered AMR Family Index 21

  31. On-going work • Soft-threshold (i.e. propagating probabilities through layers) • Multiset labels based on sequence redundancy within families. • Threshold identification for variant model counts. • Metamodel rule parsing. • Galaxy bindings (CARD/IRIDA integration). 22

  32. Summary

  33. Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. 23

  34. Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. 23

  35. Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. • Direct homology search results ARE useful when combined with machine learning. 23

  36. Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. • Direct homology search results ARE useful when combined with machine learning. • The Antibiotic Resistance Ontology provides useful structure to improve predictions. 23

  37. Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. • Direct homology search results ARE useful when combined with machine learning. • The Antibiotic Resistance Ontology provides useful structure to improve predictions. • AMRtime: coming soon to CARD and your local government genomic epidemiology platform. 23

  38. Acknowledgements

  39. Acknowledgements • McMaster University: Brian Alcock and Andrew McArthur • Simon Fraser University: Fiona Brinkman • Dalhousie University: Robert Beiko • Funding: Donald Hill Family Fellowship, Genome Canada Grant. 24

  40. Questions? 24

  41. Insufficient Intrafamily Signal Intra-Family Shared 250mers TEM beta-lactamase SHV beta-lactamase OCH beta-lactamase MIR beta-lactamase LEN beta-lactamase GES beta-lactamase AMR Family PDC beta-lactamase NDM beta-lactamase GOB beta-lactamase KPC beta-lactamase SME beta-lactamase GIM beta-lactamase TMB beta-lactamase BEL beta-lactamase CfxA beta-lactamase VEB beta-lactamase 0 200 400 600 800 Number of Shared 250mers

  42. Interfamily Collisions

  43. Interfamily Collisions

More recommend