A multi-source domain annotation pipeline for quantitative - PowerPoint PPT Presentation

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini , Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and Quantitative Biology (LCQB) - Sorbonne University

Introduction

Metagenomic analysis workflow Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment. 1

Protein domains Size of individual structural domains: • from 36 aa to 692 aa 2 • most of them has < 200 residues • average of ∼ 100 residues

Domain identification in short MG/MT reads • Very short fragments (150 - 300 bp) Two possible approaches: • Assembly-based ( e.g. , HMM-GRASPx) • Direct read annotation ( e.g. , MetaCLADE, UProC) 3

How can we represent a domain? • Multiple sequence alignment (MSA) of homologous domain sequences - profile hidden Markov models (pHMMs) - position-specific scoring matrices (PSSMs) • Probabilistic representations: 4 W S W W M W Y P V D A W H K G V D W R T G L R F F F N G G M N M D G D R L R L G Q Q S L W L A A P Y F D G G T A I T V T D I R A C C A I E Y G N V A R F L Q L T I S D F V A F L S S V N G K E L M G V T P L Q F L R Y H H A E T A L S N S S T A K E Q I G E Q E I H V V V W L S E R A R H E D L I A N M V V A A A Q R Q K W W A N A A A T C R N H P Q E W A V W Q H L Y E P S N A Y A M F A S C C C P Q S F E K K R A D R A E I F C T A C K L E T S F L S F L M Q M A G V Y I S H G R G Y V H K L K M Q K E L R R Q K H W F V G I H V S T M L P Q L F G Y L S C E L V C K V I F N T R V T L S Y I W A Y V S T D R L C S F E T T K S N A N L M T L V M V I L S N L I V L I K A S K I M T H Y A R H K S I L D Q M H A T R V V P Y Q I S C G S L V N S L N L C S S M M T A G V L V F A C A Y Y T Q L M C Q I M F C T Q N Y P W T L M I H V A P S T A G Y L Q D L F Y D L H Q S N A K Y R C N W F N A T F A T I Y E P C L V C D C I W Q P H S F M S I M C V E N L N S I C T Q G A S I L M P N E H G Q S I K S G D A N E V G W D S V Q T I S A WebLogo 3.6.0

The Pfam database • A large collection of protein domain families • Each family is represented by two MSAs (Full and Seed) and a profile HMM 5

CLADE (CLoser sequences for Annotations Directed by Evolution) 6

MetaCLADE

MetaCLADE - Main features • Extends CLADE to handle MG/MT data • Puts all domain hits in a two-dimensional space • Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to assign confidence values to predictions • Provides data visualization of functional annotation 7

MetaCLADE - General overview 8 Hit identi fi cation Domain prediction Conservation pro fi les Predicted CDS/ORFs in MG/MT reads D 2 D 8 D 11 D 1 D 1 D 2 D 6 D 11 MAKLKVANDKA... 1. Removal of overlapping hits 2. Selection of hits with prob ≥ 0.9 3. Selection of hits with best D 2 D 2 bit-score and % identity Input sequence D 2 D 8 D 3 D 1 D 1 Domain hits on the D 2 D 6 D 11 D 6 D 11 D 11 input sequence D 2 D 2 D 1 D 2 D 1 D 2 D 8 D 3 D 2 D 8 D 3 D 2 D 8 D 3 Domain D i in CLADE CLADE model library CCMs D 1 … … … . ... D i … … … . ... D N … … … . Global-consensus Sets of positive and Identi fi cation of domain-speci fi c negative sequences separating parameters Domain-dependent probability space pre-computation

MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M S i 1 Full i ... S i n i n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M S i PSI-BLAST CCM i 1 1 Full i ... NR ... S i n i n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

MetaCLADE - Model construction • Inherited from CLADE 9 Pfam domains Seed i pHMM i D 1 ... D i ... D M S i PSI-BLAST CCM i 1 1 Full i ... NR ... S i PSI-BLAST CCM i n i n i n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

MetaCLADE - Model construction • Inherited from CLADE 9 CLADE Library Global consensus models Pfam domains Seed i pHMM i ... ... pHMM 1 pHMM i pHMM M D 1 ... D i Clade-centered models ... D M ... CCM 1 CCM 1 S i PSI-BLAST CCM i 1 n 1 1 1 Full i ... ... ... CCM i CCM i ... NR ... 1 n i ... ... S i PSI-BLAST CCM i ... CCM M CCM M n i n i n M 1 n i ≤ 350 • Several models are built in order to represent each known Pfam family • Pfam27: 14 831 domains ⇒ 2 389 235 CCMs (PSSMs) and 14 831 pHMM

10 MetaCLADE - General idea D 1 D 1 D 2 D 6 D 1 1 D 2 D 2 D 2 D 8 D 3

MetaCLADE - Training set and random fragments from Seed i . methods: 1. 2-mer shuffling 2. sequence reversal 3. Markov model (probabilities based on 4-mers of Seed i ) 11 The set of positive sequences for a domain D i is based on suffixes, prefixes The set of negative sequences for a domain D i is based on three different

MetaCLADE - Training set • Negative sequences are generated until • Generation of a 2-dimension space by considering the bit score and the mean bit score of domain hits 12 | negative sequences | ≥ 1 2 | positive sequences |

MetaCLADE - Hit classification • A discrete version of a naive Bayes classifier is used in order to partition the hit space in regions with an associated probability 13

MetaCLADE - Hit filtering and domain prediction 14 1. Removal of overlapping hits 2. Selection of hits with prob ≥ 0.9 3. Selection of hits with best bit-score and % identity D 1 D 1 D 2 D 6 D 11 D 6 D 11 D 11 D 2 D 2 D 1 D 2 D 1 D 2 D 8 D 3 D 2 D 8 D 3 D 2 D 8 D 3 Final prediction D 2 D 8 D 11

Results

MetaCLADE on metatranscriptomics • Marine eukaryotic phytoplankton metatranscriptoms 15 • 1 . 5M high quality cDNA sequences, average length of 242bp

MetaCLADE - Functional Annotation 16

MetaCLADE - MetaCLADE vs HMMER (Ion transport) 17

MetaCLADE - Higher resolution 18

MetaCLADE - Comparison with other methods 68.9 81.7 95.0 71.7 137 219 18 138 347 936 Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp 93.3 290 155 54.6 220 368 19 060 264 787 UProC 200 bp 66.8 HMMGRASP 37 189 51.8 83.6 83.0 92.8 75.2 120 514 28 444 364 641 UProC+MetaCLADE 94.4 195 000 75.0 121 488 21 479 363 667 MetaCLADE+UProC 71.4 88.6 59.8 94.0 MetaCLADE 378 181 336 302 323 009 MetaCLADE 58.9 94.3 42.9 448 249 20 258 UProC 461 542 25 965 F-score PPV TPR FN FP TP Tool 12 145 100 bp 41.2 405 734 96.4 406 370 UProC+MetaCLADE 67.0 95.3 51.7 378 817 20 145 19 MetaCLADE+UProC 57.7 89.8 41.9 455 822 37 224 328 729 HMMGRASP 57.1 Precision−Recall curve 1.00 0.99 0.98 0.97 Precision 0.96 0.95 0.94 0.93 UProC (AUC=0.529) MetaCLADE (AUC=0.699) 0.92 MetaCLADE+UProC (AUC=0.727) 0.000 0.095 0.190 0.285 0.380 0.475 0.570 0.665 0.760 Recall

Future improvements • More domains and new models for an improved annotation • Constructing a library of conserved small motifs • Annotation of longer sequences • Reduction of the number of redundant models • New criteria to filter overlapping hits 20

Conclusions • Learning about the functional activity of the community and its sub-communities is a crucial step to understand species interactions and large-scale environmental impact • Functional annotation methods need to be as precise as possible in identifying remote homology • MetaCLADE allows for the discovery of patterns in highly divergent sequences • Unknown sequences will augment in number, hence probabilistic models are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces 21

Thank you for your attention! Acknowledgments • Ari Ugarte • Juliana Silva Bernardes • Alessandra Carbone References A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, “ A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling, ” Microbiome , 2018, 6:149. J.S. Bernardes, C. Vaquero, G. Zaverucha, A. Carbone, “ Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence, ” PLoS Computational Biology , 2016 12(7):e1005038. 22

A multi-source domain annotation pipeline for quantitative - PowerPoint PPT Presentation

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini , Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and

Multi-Source Adjustment of Multi-Layer Annotation: the Bits of Wisdom Approach Kilian Evang 20

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Project Simple Annotation Pipeline - Ranjit Kumaresan Simple Annotation Pipeline Run a gene

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Web Annotations Building the Experience Annotation An annotation is something added. It is not

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Annotation Graphs, Annotation Servers and Multi-Modal Resources Infrastructure for

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent No Hlne

FUNCTIONAL PEPTIDOMICS OF AMPHIBIAN VENOMS The dermal granular (venom) gland The dermal granular

Distance Metrics Mark Voorhies 5/14/2015 Mark Voorhies Distance Metrics New verbs f u n c t i

On some distributional properties of Gibbs-type priors Igor Pr unster University of Torino

The study of microbial communities: Bioinformatics applications within the UL HPC environment UL

CHARTER SCHOOLS 2 _ _

S AFEGUARDS , U NPRECEDENTED T IMES , AND A DVOCACY P ART 2 Leslie Lipson, J.D. Katie Chandler,

Vulvar Disease: An Update Rachel Kornik MD Assistant Professor of Dermatology Disclosure I

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

A multi-source domain annotation pipeline for quantitative - PowerPoint PPT Presentation

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini , Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and

Multi-Source Adjustment of Multi-Layer Annotation: the Bits of Wisdom Approach Kilian Evang 20

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Project Simple Annotation Pipeline - Ranjit Kumaresan Simple Annotation Pipeline Run a gene

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools &amp; Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Web Annotations Building the Experience Annotation An annotation is something added. It is not

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Annotation Graphs, Annotation Servers and Multi-Modal Resources Infrastructure for

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering &amp; Research

Pipeline Construction Pipeline Construction Challenges Challenges NAPCA Workshop August 19,

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Using k-mers with errors for Nanopore read analysis - Quentin Bonenfant Laurent No Hlne

FUNCTIONAL PEPTIDOMICS OF AMPHIBIAN VENOMS The dermal granular (venom) gland The dermal granular

Distance Metrics Mark Voorhies 5/14/2015 Mark Voorhies Distance Metrics New verbs f u n c t i

On some distributional properties of Gibbs-type priors Igor Pr unster University of Torino

The study of microbial communities: Bioinformatics applications within the UL HPC environment UL

CHARTER SCHOOLS 2 ___________________________________ ___________________________________

S AFEGUARDS , U NPRECEDENTED T IMES , AND A DVOCACY P ART 2 Leslie Lipson, J.D. Katie Chandler,

Vulvar Disease: An Update Rachel Kornik MD Assistant Professor of Dermatology Disclosure I

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

CHARTER SCHOOLS 2 _ _