AMR and machine-learning Prediction of AMR from metagenomes among other things Finlay Maguire finlaymaguire@gmail.com December 3, 2019 Faculty of Computer Science, Dalhousie University
Table of contents 1. Genomic Phenotype Prediction 2. Non-Bioinformatics Interlude 3. AMRtime 1
Genomic Phenotype Prediction
Antibiotic Susceptibility Testing Bradley et al. (2015) 2
AAFC Salmonella Data-set 3132 (Q2) 3314 (S1) (S1) ( ) 3134 (Q2) 3144 (Q2) ( S 1 ( S 3145 (Q2) (S1) ( S 1 1 S 3302 (S1) (S1) S 1 ) ( 1 ) 3323 (S1) (S1) ) ) 3344 3 3 6 3311 (S1) (J2) 3 3 3 2 3 3 3319 (S1) 3324 3 4 2 3 (J2) 3306 2 1 1 3 5 2 ) 3305 1 3337 (S1) 1 3184 S ( F ( 3179 ( ) 6 ) (O) E 3 3 1 ) 3 3315 (S1) S 3 ( 3 0 3 (F) 1 3 3 3318 (S1) 3352 8 1 1 3 ( 3 3338 (S2) D 3348 ) (D) 3 3310 (S1) 1 8 (C) 3169 0 3349 (S1) 3167 (B) 3317 (S1) 2005 (B) 1783 (P2) 2003 (A1) 2 ) 1797 ( P 5 8 ( Y 1 7 ) 3 1778 (P2) 1 9 9 ( U ) 3 1 6 6 ( U ) 3 1 6 0 ( X ) 3 1 9 8 (U) 3162 3193 (V2) 3151 (W) 3125 (G) (S2) 3333 3146 (I) 1893 3 1 (AA) 4 9 ( J 1 ) 9 2 3147 (I) 1 8 A A ) ( 4 3 3 3 1 1 8 C ) 6 A ( 3 9 3 ( L ) 3 3 2 0 0 S 1 ) 3171 3191 (M) ( N ( ) (V1) 3128 3 1 9 7 (AB) 3176 3 1 4 ( M 3126 3 2 ) (AD) 3 5 ( H 1811 3 3 ) (Q2) 1 1 3 3 ( O 9 1 3 ) (A2) 8 3 1 8 3 5 ( H 8 1 ) ) 8 1890 3 3 2 1 1 7 ( H A 3 3139 (H) 3 ) ( ) 0 8 ( (A2) 2 8 1793 3156 (K) H A 1 2 3158 (K) ) ( H ( 9 3342 1760 (P1) 2 ) 7 3332 1773 (P1) ) (T) 1 3168 1 A 0 1 ( 1769 1 7 ) (R) 4 8 7 1775 1 1771 (P1) 7 6 Z 1 6 7 7 ( (R) 6 7 7 2 (Q1) 3 7 7 6 0 2 1 1 (P1) 6 ( P 1 ) (P1) ( ( P ) P 1 Q 1 1 ) P ( 1 ) P 1 ) ( P 1 ) ( ( ) 0.056229 3
Genomic RGI Predictions 4
Linking AMR determinants to Phenotype McArthur et al. (2013) 5
Logistic Regression amr 1 amr 2 amr J ... 1 0 ... 1 genome 1 0 1 ... 1 genome 2 RGI = ... ... ... ... ... 0 0 ... 1 genome I abx 1 abx 2 abx K ... S S ... R genome 1 R R S ... genome 2 AST = ... ... ... ... ... S S ... S genome I β RGI = AST 6
Set-Covering Machines Genomes AST Decompose into K-mers Genomic K-mers Set-Covering Machine Boolean K-mer Rules 7
AST Prediction Performance A B C D A : RGI, B : RGI-efflux, C : Logistic Regression, D : Set Covering Machines. Major Disagreement is overprediction of resistance, Very Major Disagreement is underprediction 8
Learnt features/weights B A 9
Extending beyond Salmonella ARO Predictions (Kara Tsang) 10
Extending beyond Salmonella Logistic Regression 11
Genomic AST Prediction • Using direct annotations works very poorly across different organisms and resistance mechanisms. 12
Genomic AST Prediction • Using direct annotations works very poorly across different organisms and resistance mechanisms. • Even very simple logistic regression models greatly improve predictions. 12
Genomic AST Prediction • Using direct annotations works very poorly across different organisms and resistance mechanisms. • Even very simple logistic regression models greatly improve predictions. • Investigation of learnt weights and features can be very scientifically informative. 12
Non-Bioinformatics Interlude
• Non-profits have data and lots of contextualising knowledge. 13
• Non-profits have data and lots of contextualising knowledge. • No time or resources to analyse or use it 13
• Non-profits have data and lots of contextualising knowledge. • No time or resources to analyse or use it • Informaticians have the skills and resources but no specific understanding of the context. 13
• Non-profits have data and lots of contextualising knowledge. • No time or resources to analyse or use it • Informaticians have the skills and resources but no specific understanding of the context. • Many low-hanging fruit that can make big differences. 13
Refugee Women’s Health Clinic 14
Staff Scheduling 15
Language Development in Autism Qualitative Social Media Analysis (Tamara Sorenson-Duncan) 16
Alpha Diversity of Posting Activity 17
Beta Diversity of Posting Activity 18
Other on-going Projects • Halifax Community Learning Network • Shelter Nova Scotia • 211 Nova Scotia 19
AMRtime
AMR-metagenomics Genomes Sequencing Reads AMR detection AMR Genes 20
Why is this difficult?
AMR genes are rare genomically AMR Reads in Metagenome (0.643%) log(Read Count) 10 8 10 7 All (~324M) AMR (~2.1M) 2184 CARD-Prevalence Genomes at 1-10X abundance 21
AMR genes have wildly different abundances 1236 AMR PATRIC genomes 22
AMR genes have highly variable diversity 23
AMR sequence space overlaps MDS of CARD Proteins BLASTP-%ID Actual Families Affinity Clusters (Adj. Rand=0.30041) 1000 1000 500 500 0 0 500 500 1000 1000 1000 500 0 500 1000 1000 500 0 500 1000 24
Insufficient Signal in 250bp Fragments NDM Multiple Sequence Alignment 25
Insufficient Signal in 250bp Fragments NDM Multiple Sequence Alignment 26
Other constraints • No point doing what we do if people can’t use it. • Limited hardware requirements (a standard workstation or instance < 8 − 12Gb, 1 − 8 cores). • Fast enough ( < 12 hours). • Easy to install/configure. • Easy to use. • Easy to update. 27
AMRtime
AMRtime structure Input files Metagenomic Reads Processes AMR Filtering Intermediate files Output files Filtered reads CARD Sensitive Homology Classification Homology predictions Variant Identification Metamodels Variant predictions Metamodel predictions 28
Read filtering
Homology Filter Approaches Tool blastn biobloom 8 groot Max Resident Memory (GB) bwa bowtie2 6 hmmsearch_nt blastx diamond_blastx paladin 4 blastp diamond_blastp hmmsearch_aa 2 0 0 10 20 30 40 50 Elapsed Time (hours) Relative Computational Demands 29
Precision-Recall of Homology Search 1.0 Paradigm BWT BLAST k-mer 0.8 HMM Precision 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Recall 30
Optimising for recall 1.00 Tool blastx bwa 0.98 diamond_blastx paladin blastp diamond_blastp 0.96 Precision 0.94 0.92 0.90 0.90 0.92 0.94 0.96 0.98 1.00 Recall 31
Sensitive Homology Classification
Dealing with imbalanced training data Simulated AMR Reads (.fq) Encoding Encoded Reads Labels (.tsv) Stratified Test-Train (20%) Split Training Data Testing Data SMOTE Resampled Training Data Stratified 5-fold CV Training Data Folds 32
What is balance? • Different gene lengths within families (coverage vs read number)? • Different family sizes? • Different family diversity? • Using a generator to improve on SMOTE. 33
Initial classifier Training Data Classifier ARO predictions 34
Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 34
Initial classifier Training Data Classifier ARO predictions NB 7-mer Average Precision: 0.63 % 34
Revised classifier structure: exploiting the ARO Training Data AMR Family Classifier AMR Families Family 1 SMOTE Family ... SMOTE Family N SMOTE Family 1 Data Family ... Data Family N Data Family 1 Classifier Family ... Classifier Family N Classifier ARO predictions 35
Sequence similarity encoding gene 1 gene 2 gene j − 1 gene j ... 1256 0 0 63 ... read 1 0 0 0 0 ... read 2 Sequence bitscore matrix = ... ... ... ... ... ... 0 512 ... 0 0 read i − 1 0 0 785 129 ... read i Advantages: read length invariant, low dimensionality, uses filtering data computation 36
Cross-Validation • Encodings: • Raw sequence • Filtering homology search family similarity/dissimilarity • Manual feature extraction (GC/TNF/compositional) • One-hot K-mer representation • K-mer embeddings (DNA2vec/BioVec) • Classifiers: • Random Forests • Naive Bayes • Logistic Regression • Neural Networks of varying architecture (Torch) 37
Cross-validation Family Cross-Validation Performance 1.0 Metric Precision 0.8 Recall Proportion 0.6 0.4 0.2 0.0 Model 38
Held-out test results Normalised Bitscore Random Forest 1.00 0.75 Proportion 0.50 0.25 0.00 Precision Recall Family Test Peformance 39
ARO level classification more variable Median Precision-Recall Within Families 1.00 Precision Recall 0.75 Proportion 0.50 0.25 0.00 0 25 50 75 100 125 150 175 200 225 Ordered AMR Family Index 40
Family diversity as explanation? 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0 100 200 300 AMR Family Cardinality 41
Within family label imbalance 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 ARO Proportion of Family Size 42
Recommend
More recommend