Genome Annotation
The steps in genome sequencing ● Generate genome sequence – Assembly – ORF calling – tRNA identifjcation – rRNA identifjcation – Functional annotation
Annotating Genomes ● Identifying which protein performs which function
www.sigmaaldrich.com
Why annotate a genome? ● Catalog what's there ● Identify what's missing – but should be there! – Things you don't know ● In vitro growth – Mycoplasma pneumoniae ● Comparative genomics ● Hypothesis generation
The goals of annotation ● Exchange information with others ● Compare annotations between organisms
How to annotate a genome? ● Sequence ● Assemble ● Identify open reading frames – Putative proteins
Putative protein ● Open Reading Frame (ORF) – A stretch of amino acids with no stop codon ● Coding Sequence (CDS) – An ORF that could encode a protein ● Protein encoding gene (PEG) – An ORF that could encode a protein ● Hypothetical protein = putative protein – Something that has not been experimentally shown ● Polypeptide – Short stretch of ~50 amino acids. Often a domain
PEGS ● E. coli – 4,391 genes – 4,288 genes that make proteins (pegs)
ORF Calling
Genome Annotation
The steps in genome sequencing ● Generate genome sequence – Assembly – ORF calling – tRNA identifjcation – rRNA identifjcation – Functional annotation
Traditional genome annotation
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Traditional genome annotation BLAST Similarities
Protein Families
Protein Families
Protein Families
Protein Families
Gene Ontology ● Ontology – A “hierarchy” of functions – Does not need to be linear ● Directed Acyclic Graph ● Controlled Vocabulary – Decides which words or phrases to use
GO ● Gene ontology – A eukaryotic focus ● Drosophila ● Mus ● Saccharomyces ● Homo
GO ● Cellular component – The parts of a cell ● Molecular function – e.g. ligand binding ● Biological processes – What things do
GO Terms ● [GO ID, function] ● e.g: – GO:0004743 – Ontology: molecular function – Name: pyruvate kinase activity
GO Terms ● [GO ID, function] ● e.g: – GO:0004743 – Ontology: molecular function – Name: pyruvate kinase activity ● Mainly assigned by BLAST/HMMER/... etc
Directed Acyclic Graph Molecular function Catalytic activity Transferase activity Transferase activity, transferring phosphorous phosphotransferase activity, Kinase activity alcohol group as acceptor Pyruvate kinase activity
Problems ● Annotation by committee ● Eukaryotic focus – Some efgorts to counter that ● Owen White ● Arriane Toussaint ● Not very deep ● Strict controlled vocabulary
Alternatives
Basic biology lacI lacZ lacY lacA Jacob & Monod, 1961
Basic biology lacI lacZ lacY lacA
Difgerent types of clustering < 80 % < 80 % < 80%
Difgerent types of clustering < 80 % < 80 % < 80%
Purine metabolism
Difgerent types of clustering < 80 % < 80 % < 80%
Heme / chlorophyll metabolism is conserved They are both porphyrins
Occurrence of clustering in difgerent genomes 1 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters T otal number of genomes in group 120 Fraction of genes in clusters 0.8 Number of genomes 0.6 80 0.4 40 0.2 0 0 e e a - s e s i x s a a e i e a u e r i c t t g e d f c e e f o y t c o d a c i t m o u r i h o a o o c q c m b s a r l o A h u o e o l n r h r m t C n e i i c C e p a h r a D e S y T B C h T
The Subsystems Approach to Annotation ● Subsystem is a generalization of “pathway” – collection of functional roles jointly involved in a biological process or complex ● Functional Role is the abstract biological function of a gene product – atomic, or user-defjned, examples: ● 6-phosphofructokinase (EC 2.7.1.11) ● LSU ribosomal protein L31p ● Streptococcal virulence factors Should not contain “putative”, “thermostable”, etc ● ● Populated subsystem is complete spreadsheet of functions and roles
Histidine Degradation Conversion of histidine to glutamate ● Functional roles defjned in table ● Inclusion in subsystem is only by functional role ● Controlled vocabulary … ● Subsystem: Histidine Degradation 1 HutH Histidine ammonia-lyase (EC 4.3.1.3) 2 HutU Urocanate hydratase (EC 4.2.1.49) 3 HutI Imidazolonepropionase (EC 3.5.2.7) 4 GluF Glutamate formiminotransferase (EC 2.1.2.5) 5 HutG Formiminoglutamase (EC 3.5.3.8) 6 NfoD N-formylglutamate deformylase (EC 3.5.1.68) 7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)
Subsystem Spreadsheet Subsystem Spreadsheet Organism Variant HutH HutU HutI GluF HutG NfoD ForI Bacteroides thetaiotaomicron Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0 1 Desulfotela psychrophila gi51246205 gi51246204 gi51246203 gi51246202 1 Halobacterium sp . Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7 2 Deinococcus radiodurans Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04 2 Bacillus subtilis P10944 P25503 P42084 P42068 2 Caulobacter crescentus P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9 3 Pseudomonas putida Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3 3 Xanthomonas campestris Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5 3 Listeria monocytogenes -1 Column headers taken from table of functional roles ● Rows are selected genomes or organisms ● Cells are populated with specifjc, annotated genes ● Functional variants defjned by the annotated roles ● Variant code -1 indicates subsystem is not functional ● Clustering shown by color ●
“The Populated Subsystem” Subsystem: Histidine Degradation 1 HutH Histidine ammonia-lyase (EC 4.3.1.3) 2 HutU Urocanate hydratase (EC 4.2.1.49) 3 HutI Imidazolonepropionase (EC 3.5.2.7) 4 GluF Glutamate formiminotransferase (EC 2.1.2.5) 5 HutG Formiminoglutamase (EC 3.5.3.8) 6 NfoD N-formylglutamate deformylase (EC 3.5.1.68) 7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13) Subsystem Spreadsheet HutH HutU HutI GluF HutG NfoD ForI Organism Variant Bacteroides thetaiotaomicron Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0 1 Desulfotela psychrophila gi51246205 gi51246204 gi51246203 gi51246202 1 Halobacterium sp . Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7 2 Deinococcus radiodurans Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04 2 Bacillus subtilis P10944 P25503 P42084 P42068 2 Caulobacter crescentus P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9 3 Pseudomonas putida Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3 3 Xanthomonas campestris Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5 3 Listeria monocytogenes -1
Nan-operon within Sialic Acid Metabolism Microbial sialic acid metabolism has now been frmly established as a virulence determinant in a range of infectious diseases
The nan -operon
Recommend
More recommend