Bioinformatics Tools for Analyzing Enzyme Families Greg Butler Department of Computer Science and Software Engineering Centre for Structural and Functional Genomics Concordia University, Montreal www.cs.concordia.ca/~gregb gregb@cs.concordia.ca
Outline Overview of Tools PipeAlign Panther FlowerPower Case Study on Cellulases
Overview of Tools — Problems Addressed Multiple Sequence Alignment (MSA) Problem : Given a set of protein sequences, and an objective function , determine the optimal alignment of the sequences. SubProblems: Selecting Homologues to Align; Choice of Objective Func- tion; High-Quality MSA; Removing Outlier Sequences Phylogenetic Tree Construction Problem : Given a set of protein sequences, and their pairwise distances (using some distance metric), constuct a phylogenetic tree . Classifier for Family — usually Hidden Markov Model (HMM) Problem : Given a family of protein sequences, and a multiple sequence alignment (MSA) for the family, construct a classifier which given a pro- tein sequence can determine whether or not the protein is a member of the family. Split Family into Subfamilies Problem : Given a family of protein sequences, and a multiple sequence alignment (MSA) for the family, and ..., determine a clustering of the sequences in the family into subfamilies. Consistency of MSA, Tree, Classifiers for Family and Subfamilies
Overview of Tools PipeAlign Given a seed sequence, constructs MSA for family and sub- families. — (optional) include candidate family members http://igbmc.u-strasbg.fr/PipeAlign/ Panther DB of sequences, MSAs, trees, and HMM classifiers for protein families and subfamilies semi-automatically for human, mouse, ... Given query protein sequence, classifiers determine family and subfamily — can download all HMM classifiers https://panther.appliedbiosystems.com/ FlowerPower Given seed protein sequence, determines the family and subfamily, their MSAs, trees, and HMM classifiers. — like improved PSI-Blast against UniProt for MSA’s — postprocess MSAs using Bˆ ETE for trees, GTREE for display http://phylogenomics.berkeley.edu/cgi-bin/flowerpower/input flowerpower.py
What is an Enzyme? Enzyme is a protein that catalyses a reaction.
What is an Enzyme? Enzymes are very specific . Enzymes are very efficient catalysts .
Enzyme Families and (Some) Classification Schemes Aim: To classify and organize enzymes. EC (Enzyme Commission) numbers ”To consider the classification and nomenclature of enzymes and coen- zymes, their units of activity and standard methods of assay, together with the symbols used in the description of enzyme kinetics.” GO (Gene Ontology) three classifications of gene products — molecular function — biological process — cellular component CATH : Class, Architecture, Topology, Homology “There is no objective definition. a family is clearly related by sequence similarity, a superfamily is composed of families whose sequence rela- tionship isn’t clear, but which are believed on structural and functional grounds to be homologous, and a fold is a group of superfamilies that share a common structural topology but are not necessarily homolo- gous.” InterPro combination of many classification schemes
Gene Ontology — Entry
InterPro
Multiple Sequence Alignment (MSA) Problem : Given a set of protein sequences, and an objective function , determine the optimal alignment of the sequences. Why? Amino acid sequence determines protein structure determines enzyme function
MSA Issues Multiple sequence alignment is a complicated task — choice of the sequences — choice of an objective function — the optimization of the objective function Issues — math vs biology (optimal MSA not necessarily ”good” MSA for biologist) — outliers affect results — divergence can affect choice of parameters/algorithms — multi-domain sequences are problems — many sequences, long sequences costly Ideal — align closely related sequences — trim so only one domain present — feed in lots of constraints eg, structural information
Approaches to MSA Progressive — sequences are added one by one to the multiple alignment according to a precomputed order Iterative — iteratively modify a sub-optimal solution Stochastic iterative — randomly modify — result is either kept or discarded dependent on an acceptance function — convergence via more stringent acceptance function Consistency-based “given a set of independent observations, the most consistent are often closer to the truth” — optimal MSA is one that agrees the most with all the possible optimal pair-wise alignments Constraint-based — use prior information as constraints on the alignment
Recent MSA Algorithms and Systems Partial Order Alignment (POA) Progressive POA MUSCLE and SATCHMO Indonesia system (Uppsala) using structural constraints DIALIGN with User Constraints
Splitting Families into Subfamilies Problem : Given the sequences for a family of enzymes, determine how to delineate cohesive subfamilies. Why? : more homologous means easier to study — easier to build better alignments — easier to build better classifiers Subproblem: remove outliers from the set of sequences
Building Classifiers for Enzyme Families Problem : Given the sequences for a family of enzymes, determine how to decide membership in the family. “In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint.” A profile or weight matrix is a table of position-specific amino acid weights and gap costs. A domain is a conserved protein region. — “independently folding structural unit” A fingerprint is a group of conserved motifs used to characterise a protein family.
Panther System from Celera ”The PANTHER database was designed for high-throughput analysis of protein se- quences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have as- sociated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automat- ically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned infor- mation, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster.”
Panther System from Celera
Panther System from Celera
Panther System from Celera
PipeAlign System from Strasbourg
PipeAlign Tools Ballast — getting a better set of homologues — conservation profile DbClustal — combined local and global alignment using anchors NorMD — a reliable objective function — normalized Mean Distance scores RASCAL and LEON — detection and correction of alignment errors — removal of outliers — realignment of blocks and inter-block regions Secator and DPC — split into subfamilies
Inferring Paralogs and Orthologs — Basic Idea Compare phylogenetic tree of sequences with species tree Systems : RIO, Orthostrapper, Bˆ ETE
RIO: Resampled Inference of Orthologs Problem with Basic Idea: Inaccuracy of Phylogenetic Tree Solution: Bootstrap resampling of tree gives probability of phylogenetic tree indicating ortholog New concepts from RIO: super-ortholog, ultra-paralog, subtree neighbour Still problem of inferring function (even) from orthologs!
Cellulase Case Study Dataset of biochemically characterized cellulases for Kwang-Bo Joung: • 27 endoglucosidases (egl) EC 3.2.1.4 • 23 cellobiohydrolases (cbh) EC 3.2.1.91 • 28 beta-glucosidases (bgl) EC 3.2.1.21 Characterized into 92 families of Glycosol Hydrolases (GH) Kwang-Bo Joung aligned domains using ClustalW and combined into tree. Used Prosite patterns to clarify subfamily membership. Noted misclassifications: — P46236, GH Family 6, in SP as egl ; literature says “cellulase” (ie egl or cbh ); should be cbh — P37698, GH Family 48, should be cbh not egl Kwang-Bo Joung’s classification (see tree): egl-A of size 4, 2 in GH Family 45, 2 in GH Family 7 (has cbh) egl-B of size 1 in GH Family 6 (has cbh) egl-C of size 2 in GH Family 9 (has cbh) egl-D of size 5 in GH Family 12 and 8 egl-E of size 15 in GH Family 5 cbh-A of size 14 in GH Family 7 (has egl) cbh-B of size 5 in GH Family 6 (has egl) cbh-C of size 2 in GH Family 9 (has egl) cbh-D of size 3 in GH Family 48 bgl-A of size 13 in GH Family 3 bgl-B of size 15 in GH Family 1
Cellulase Case Study
Recommend
More recommend