AMRtime Precise identification of antimicrobial resistance determinants from metagenomic data Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University
Table of contents 1. Background 2. AMRtime Overview 3. Filtering out non-AMR reads 4. Sensitive Homology Classification 1
Background
AMR-metagenomics Genomes Sequencing Reads AMR detection AMR Genes 2
Comprehensive Antibiotic Resistance Database card.mcmaster.ca 3
Why is AMR metagenomics difficult?
AMR genes are rare genomically AMR Reads in Metagenome (0.643%) log(Read Count) 10 8 10 7 All (~324M) AMR (~2.1M) 2184 CARD-Prevalence Genomes at 1-10X abundance 4
AMR genes have wildly different abundances 1236 AMR PATRIC genomes 5
AMR sequence space overlaps MDS of CARD Proteins BLASTP-%ID Actual Families Affinity Clusters (Adj. Rand=0.30041) 1000 1000 500 500 0 0 500 500 1000 1000 1000 500 0 500 1000 1000 500 0 500 1000 6
AMRtime Overview
AMRtime structure Input files Metagenomic Reads Processes AMR Filtering Intermediate files Output files Filtered reads CARD Sensitive Homology Classification Homology predictions Variant Identification Metamodels Variant predictions Metamodel predictions 7
AMRtime structure Input files Metagenomic Reads Processes AMR Filtering Intermediate files Output files Filtered reads CARD Sensitive Homology Classification Homology predictions Variant Identification Metamodels Variant predictions Metamodel predictions 8
AMRtime structure Input Files Metagenomic Reads CARD Processes Read Filtering Intermediate Files Output Files Filtered Reads Features Sensitive AMR Classification ARO Predictions 9
Filtering out non-AMR reads
Testing sequence similarity search tools ESKAPE Genomes Resistance Gene Identi fi er ART Read Simulator + CARD Labeled Simulated Metagenome ORFM Predicted ORF Protein Sequences NT Query & NT CARD NT Query & AA CARD AA Query & AA CARD Database Methods Database Methods Database Methods - BLASTN - BLASTX - BLASTP - bowtie2 - DIAMOND BLASTX - DIAMOND BLASTP - BWA-MEM - PALADIN - HMMSearch - biobloom* - groot - HMMSearch 10
Terminology refresher interlude https://commons.wikimedia.org/wiki/File:Precisionrecall.svg 11
DNA subject best for precision, Protein subject best for recall 1.0 Domain DNA Query/DB DNA Query, Protein DB 0.8 Protein Query/DB Precision 0.6 0.4 0.2 0.00 0.25 0.50 0.75 1.00 Recall Simulated MiSeq v3 250bp reads, 30.31M reads (7.21M AMR derived) 12
K-mer methods perform poorly 1.0 Paradigm BWT BLAST 0.8 k-mer HMM Precision 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Recall BWT: bowtie2, bwa-mem, paladin; BLAST: blast, diamond; HMM: 13 hmmsearch; K-MER: biobloom, groot.
DIAMOND-BLASTX best compromise 1.00 Tool blastx 0.98 bwa diamond_blastx paladin 0.96 Precision blastp diamond_blastp 0.94 0.92 0.90 0.90 0.92 0.94 0.96 0.98 1.00 Recall DIAMOND-BLASTX ‘more sensitive’ setting (min < 1 e − 10 ): 4.926 hours with 2 cores and 8.3Gb of memory. AMR Reads: 7.15M detected, 59.26K missed, 1.87M false positives. 14
Why not just use these sequence searches?
Poor gene-level accuracy ARO Accuracy groot diamond_blastp diamond_blastx blastp Tool blastx paladin blastn bowtie2 bwa 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of reads per ARO correct Performance at optimal settings for ARO accuracy 15
Good family-level accuracy Correct Family groot hmmsearch_nt bowtie2 bwa hmmsearch_aa Tool blastn paladin diamond_blastp diamond_blastx blastx blastp 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of reads per family correct Performance at optimal settings for Family accuracy 16
Sensitive Homology Classification
Initial classifier Training Data Classifier ARO predictions 17
Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 17
Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 % 17
Revised classifier structure: exploiting the ARO Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family ... SMOTE Family N SMOTE Family 1 Data Family ... Data Family N Data Family 1 Classifier Family ... Classifier Family N Classifier ARO predictions 18
Read encoding gene 1 gene 2 gene j − 1 gene j ... 1256 0 0 63 read 1 ... 0 0 0 0 read 2 ... Sequence bitscore matrix = ... ... ... ... ... ... 0 512 0 0 read i − 1 ... 0 0 785 129 read i ... Advantages: read length invariant, low dimensionality, uses filtering data 19
Held-out test results Normalised Bitscore Random Forest 1.00 0.75 Proportion 0.50 0.25 0.00 Precision Recall Family Test Peformance Mean Precision: 0.995, Mean Recall: 0.985 20
ARO level classification more variable Median Precision-Recall Within Families 1.00 Precision Recall 0.75 Proportion 0.50 0.25 0.00 0 25 50 75 100 125 150 175 200 225 Ordered AMR Family Index 21
On-going work • Soft-threshold (i.e. propagating probabilities through layers) • Multiset labels based on sequence redundancy within families. • Threshold identification for variant model counts. • Metamodel rule parsing. • Galaxy bindings (CARD/IRIDA integration). 22
Summary
Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. 23
Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. 23
Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. • Direct homology search results ARE useful when combined with machine learning. 23
Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. • Direct homology search results ARE useful when combined with machine learning. • The Antibiotic Resistance Ontology provides useful structure to improve predictions. 23
Conclusions • Direct homology searches are suprisingly poor for AMR metagenomics. • K-mer based approaches fall flat with sequencing error, low coverage and sparse labels. • Direct homology search results ARE useful when combined with machine learning. • The Antibiotic Resistance Ontology provides useful structure to improve predictions. • AMRtime: coming soon to CARD and your local government genomic epidemiology platform. 23
Acknowledgements
Acknowledgements • McMaster University: Brian Alcock and Andrew McArthur • Simon Fraser University: Fiona Brinkman • Dalhousie University: Robert Beiko • Funding: Donald Hill Family Fellowship, Genome Canada Grant. 24
Questions? 24
Insufficient Intrafamily Signal Intra-Family Shared 250mers TEM beta-lactamase SHV beta-lactamase OCH beta-lactamase MIR beta-lactamase LEN beta-lactamase GES beta-lactamase AMR Family PDC beta-lactamase NDM beta-lactamase GOB beta-lactamase KPC beta-lactamase SME beta-lactamase GIM beta-lactamase TMB beta-lactamase BEL beta-lactamase CfxA beta-lactamase VEB beta-lactamase 0 200 400 600 800 Number of Shared 250mers
Interfamily Collisions
Interfamily Collisions
More recommend