A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini , Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and Quantitative Biology (LCQB) - Sorbonne University
Introduction
Metagenomic analysis workflow Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment. 1
Protein domains Size of individual structural domains: • from 36 aa to 692 aa 2 • most of them has < 200 residues • average of ∼ 100 residues
Domain identification in short MG/MT reads • Very short fragments (150 - 300 bp) Two possible approaches: • Assembly-based ( e.g. , HMM-GRASPx) • Direct read annotation ( e.g. , MetaCLADE, UProC) 3
How can we represent a domain? • Multiple sequence alignment (MSA) of homologous domain sequences - profile hidden Markov models (pHMMs) - position-specific scoring matrices (PSSMs) • Probabilistic representations: 4 W S W W M W Y P V D A W H K G V D W R T G L R F F F N G G M N M D G D R L R L G Q Q S L W L A A P Y F D G G T A I T V T D I R A C C A I E Y G N V A R F L Q L T I S D F V A F L S S V N G K E L M G V T P L Q F L R Y H H A E T A L S N S S T A K E Q I G E Q E I H V V V W L S E R A R H E D L I A N M V V A A A Q R Q K W W A N A A A T C R N H P Q E W A V W Q H L Y E P S N A Y A M F A S C C C P Q S F E K K R A D R A E I F C T A C K L E T S F L S F L M Q M A G V Y I S H G R G Y V H K L K M Q K E L R R Q K H W F V G I H V S T M L P Q L F G Y L S C E L V C K V I F N T R V T L S Y I W A Y V S T D R L C S F E T T K S N A N L M T L V M V I L S N L I V L I K A S K I M T H Y A R H K S I L D Q M H A T R V V P Y Q I S C G S L V N S L N L C S S M M T A G V L V F A C A Y Y T Q L M C Q I M F C T Q N Y P W T L M I H V A P S T A G Y L Q D L F Y D L H Q S N A K Y R C N W F N A T F A T I Y E P C L V C D C I W Q P H S F M S I M C V E N L N S I C T Q G A S I L M P N E H G Q S I K S G D A N E V G W D S V Q T I S A WebLogo 3.6.0
The Pfam database • A large collection of protein domain families • Each family is represented by two MSAs (Full and Seed) and a profile HMM 5
CLADE (CLoser sequences for Annotations Directed by Evolution) 6
MetaCLADE
MetaCLADE - Main features • Extends CLADE to handle MG/MT data • Puts all domain hits in a two-dimensional space • Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to assign confidence values to predictions • Provides data visualization of functional annotation 7
MetaCLADE - General overview 8 Hit identi fi cation Domain prediction Conservation pro fi les Predicted CDS/ORFs in MG/MT reads D 2 D 8 D 11 D 1 D 1 D 2 D 6 D 11 MAKLKVANDKA... 1. Removal of overlapping hits 2. Selection of hits with prob ≥ 0.9 3. Selection of hits with best D 2 D 2 bit-score and % identity Input sequence D 2 D 8 D 3 D 1 D 1 Domain hits on the D 2 D 6 D 11 D 6 D 11 D 11 input sequence D 2 D 2 D 1 D 2 D 1 D 2 D 8 D 3 D 2 D 8 D 3 D 2 D 8 D 3 Domain D i in CLADE CLADE model library CCMs D 1 … … … . ... D i … … … . ... D N … … … . Global-consensus Sets of positive and Identi fi cation of domain-speci fi c negative sequences separating parameters Domain-dependent probability space pre-computation
MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M S i 1 Full i ... S i n i n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M S i PSI-BLAST CCM i 1 1 Full i ... NR ... S i n i n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M S i PSI-BLAST CCM i 1 1 Full i ... NR ... S i PSI-BLAST CCM i n i n i n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
MetaCLADE - Model construction • Inherited from CLADE 9 CLADE Library Global consensus models Pfam domains Seed i pHMM i ... ... pHMM 1 pHMM i pHMM M D 1 ... D i Clade-centered models ... D M ... CCM 1 CCM 1 S i PSI-BLAST CCM i 1 n 1 1 1 Full i ... ... ... CCM i CCM i ... NR ... 1 n i ... ... S i PSI-BLAST CCM i ... CCM M CCM M n i n i n M 1 n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM
10 MetaCLADE - General idea D 1 D 1 D 2 D 6 D 1 1 D 2 D 2 D 2 D 8 D 3
MetaCLADE - Training set and random fragments from Seed i . methods: 1. 2-mer shuffling 2. sequence reversal 3. Markov model (probabilities based on 4-mers of Seed i ) 11 The set of positive sequences for a domain D i is based on suffixes, prefixes The set of negative sequences for a domain D i is based on three different
MetaCLADE - Training set • Negative sequences are generated until • Generation of a 2-dimension space by considering the bit score and the mean bit score of domain hits 12 | negative sequences | ≥ 1 2 | positive sequences |
MetaCLADE - Hit classification • A discrete version of a naive Bayes classifier is used in order to partition the hit space in regions with an associated probability 13
MetaCLADE - Hit filtering and domain prediction 14 1. Removal of overlapping hits 2. Selection of hits with prob ≥ 0.9 3. Selection of hits with best bit-score and % identity D 1 D 1 D 2 D 6 D 11 D 6 D 11 D 11 D 2 D 2 D 1 D 2 D 1 D 2 D 8 D 3 D 2 D 8 D 3 D 2 D 8 D 3 Final prediction D 2 D 8 D 11
Results
MetaCLADE on metatranscriptomics • Marine eukaryotic phytoplankton metatranscriptoms 15 • 1 . 5M high quality cDNA sequences, average length of 242bp
MetaCLADE - Functional Annotation 16
MetaCLADE - MetaCLADE vs HMMER (Ion transport) 17
MetaCLADE - Higher resolution 18
MetaCLADE - Comparison with other methods 68.9 81.7 95.0 71.7 137 219 18 138 347 936 Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp 93.3 290 155 54.6 220 368 19 060 264 787 UProC 200 bp 66.8 HMMGRASP 37 189 51.8 83.6 83.0 92.8 75.2 120 514 28 444 364 641 UProC+MetaCLADE 94.4 195 000 75.0 121 488 21 479 363 667 MetaCLADE+UProC 71.4 88.6 59.8 94.0 MetaCLADE 378 181 336 302 323 009 MetaCLADE 58.9 94.3 42.9 448 249 20 258 UProC 461 542 25 965 F-score PPV TPR FN FP TP Tool 12 145 100 bp 41.2 405 734 96.4 406 370 UProC+MetaCLADE 67.0 95.3 51.7 378 817 20 145 19 MetaCLADE+UProC 57.7 89.8 41.9 455 822 37 224 328 729 HMMGRASP 57.1 Precision−Recall curve 1.00 0.99 0.98 0.97 Precision 0.96 0.95 0.94 0.93 UProC (AUC=0.529) MetaCLADE (AUC=0.699) 0.92 MetaCLADE+UProC (AUC=0.727) 0.000 0.095 0.190 0.285 0.380 0.475 0.570 0.665 0.760 Recall
Future improvements • More domains and new models for an improved annotation • Constructing a library of conserved small motifs • Annotation of longer sequences • Reduction of the number of redundant models • New criteria to filter overlapping hits 20
Conclusions • Learning about the functional activity of the community and its sub-communities is a crucial step to understand species interactions and large-scale environmental impact • Functional annotation methods need to be as precise as possible in identifying remote homology • MetaCLADE allows for the discovery of patterns in highly divergent sequences • Unknown sequences will augment in number, hence probabilistic models are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces 21
Thank you for your attention! Acknowledgments • Ari Ugarte • Juliana Silva Bernardes • Alessandra Carbone References A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, “ A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling, ” Microbiome , 2018, 6:149. J.S. Bernardes, C. Vaquero, G. Zaverucha, A. Carbone, “ Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence, ” PLoS Computational Biology , 2016 12(7):e1005038. 22
Recommend
More recommend