DoTS: integrated gene indices for human and mouse built from transcribed sequences Running Title: DoTS gene indices Y Thomas Gan 1,2 , Brian Brunk 1 , Jonathan Crabtree 1,2 , Deborah Pinney 1,2 , Steve Fischer 1,2 , Joan Mazzarelli 1,2 , Otto Valladares 2 , Maja Bucan 2 , Christian J. Stoeckert, Jr. 1,2 1 Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA 2 Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA Y Thomas Gan: 215-746-7013 (tel), 215-573-3111 (fax), ygan@pcbi.upenn.edu (email) Brian Brunk: 215-573-3118 (tel), 215-573-3111 (fax), brunkb@pcbi.upenn.edu (email) Jonathan Crabtree: 215-573-3115 (tel), 215-573-3111 (fax), crabtree@pcbi.upenn.edu (email) Deborah Pinney: 215-573-3116 (tel), 215-573-3111 (fax), pinney@pcbi.upenn.edu (email) Steve Fischer: 215-573-2280 (tel), 215-573-3111 (fax), sfischer@pcbi.upenn.edu (email) Joan Mazzarelli: 215-573-4413 (tel), 215-573-3111 (fax), mazz@pcbi.upenn.edu (email) Otto Valladares: 215-898-0021 (tel), 215-573-2041 (fax), OttoV@mail.med.upenn.edu (email) Maja Bucan: 215-898-0020 (tel), 215-573-2041 (fax), bucan@pobox.upenn.edu (email) Corresponding author: Christian J. Stoeckert. Jr. 215-573-4409 (tel), 215-573-3111 (fax), stoeckrt@pcbi.upenn.edu (email)
Genome Biology Abbreviations used in this paper: EST : expressed sequence tag DoTS : database of transcribed sequences DT : DoTS Transcript DG : DoTS Gene sDG : similarity-based DoTS Gene gDG : genome-based DoTS Gene TC : tentative consensus BLAST : basic local alignment search tool BLAT : BLAST-like alignment tool UTR : un-translated region ORF : open reading frame CDS : (protein) coding sequence
Genome Biology Abstract Background Although sequences for large eukaryotic genomes are being completed, it remains a challenge to identify all genes encoded by them and determine or predict their functions. To help address this challenge, we have built a Database of Transcribed Sequences (DoTS). We cluster and assemble ESTs and mRNAs into DoTS Transcripts (DTs). We further group DTs representing transcripts from the same genes into DoTS Genes (DGs). We describe human and mouse DoTS here, although DoTS is generic and applicable to other species such as apicomplexa [1]. Results We have built an integrated transcriptome resource, DoTS, for human and mouse. In DoTS we catalogue, categorize, and annotate known and predicted transcripts and genes. We have identified 48,994 human and 37,984 mouse high confidence DGs, of which 25,326 human and 22,024 mouse DGs are predicted to be protein-coding genes. Using these data, we can predict novel genes as demonstrated using a 75Mb proximal region on mouse chromosome 5. We have found that DGs can significantly enrich the models of known genes by predicting extended UTRs, novel exons, and alternative transcription starts. DoTS also enables the study of non- coding genes and singleton transcripts (DTs with only one input EST or mRNA), in addition to other studies such as the investigation of alternative splicing. A powerful query interface for human and mouse DoTS is available at http://www.allgenes.org [2]. Conclusion DoTS Transcripts and DoTS Genes, which are extensively annotated and significantly curated, present a unique, integrated, non-redundant, and genome-mapped view of the millions of ESTs and mRNAs in the public domain. They are categorized into various subsets such as high
Genome Biology confidence genes, protein-coding genes, and non-coding genes. They predict many putative novel genes, enrich gene models of known genes, and enable datamining in novel directions. Background and significance In a post-genomic era, identifying all genes and studying their functions and relationships are among the ongoing challenges in the field of functional genomics. Transcribed sequences (mRNAs and ESTs) may be used to build integrated transcriptome data resources to help address such challenges. Genomic data integration Much progress has been made recently in sequencing large eukaryotic genomes. We now have an essentially complete sequence for the human genome [3-5] and a draft for mouse [6]. Coincident with the explosion of genomic sequence data is the rapidly growing availability of vast amounts of functional genomics data such as expressed sequence tags (ESTs), proteomes, protein domains, and microarray gene expression data. For example, as of October, 2003, there are 5.4 million human and 3.9 million mouse ESTs in the public EST repository dbEST [7]. It is necessary to integrate these diverse types of data to facilitate gene identification and functional annotation. Transcribed sequences for data integration Transcribed sequences are a good integration point. First, they are the products of gene transcription, and they are abundant as a result of the large scale EST sequencing efforts. Therefore, they can be used for gene discovery and analysis of gene structure (e.g. exon-intron structures, alternative splicing), in genomic sequences via alignments. Second, expression
Genome Biology information is usually available for ESTs, based on the libraries from which they originate. In addition, ESTs are commonly used to generate features on microarrays. Therefore, transcribed sequences allow easy integration of expression information with genes, providing the basis for expression analyses. Third, transcribed sequences may be translated to allow protein sequence analyses (e.g. domain based functional annotation, ortholog identification). Fourth, they may be aligned with genomic sequences to identify regulatory regions. Finally, they may originate from genes that do not encode proteins, therefore, they allow the identification of non-coding genes. Existing transcriptome data resources Human and mouse genome and transcriptome data are available from several sites [8]. Although there is overlap in the information presented, the sites generally provide unique views or emphases. This is expected as we are far from a complete understanding of the wealth of information provided by genome sequencing, EST sequencing, and microarray experiments. Groups such as Ensembl [9, 1 0] or the UCSC Genome Browser team [11] use the genome as their reference point. Another approach is to use shared identifiers (accessions) from different resources to organize and integrate information as is done by GeneCards [12] and MGI [13], which focus on known genes and emphasize phenotypes. These approaches are complementary, and they provide different views and different interpretations of the data. For example, transcribed sequences that cannot be properly aligned to the genome would fail to be seen as primary entities on genome-based views. Unigene [14] and the TIGR gene indices [15] represent multiple species transcriptome data resources organized around transcribed sequences. Other efforts in this class include MGC [16], RefSeq [17], STACK [18], and MIPS [19]. Unigene uses sequence similarity to cluster all ESTs and mRNAs but does not generate consensus sequences. Essentially, the Unigene clusters represent ESTs associated with the same gene. The great strength of Unigene is its currency but
Genome Biology one of its weaknesses is the lack of persistent identifiers. TIGR gene indices provide consensus sequences and persistent identifiers, and they also have data on orthologs for species other than human and mouse, which enables comparative genomics studies using more than two species. TIGR assemblies (TCs) represent transcripts rather than genes, therefore they are a transcript- centric, not gene-centric resource. MGC focuses on full length cDNAs, and RefSeq underscore known and curated genes, therefore, they are both limited in scope. DoTS as a transcriptome resource DoTS, short for Database of Transcribed Sequences, is a collective name to describe DoTS Transcripts (DTs) and DoTS Genes (DGs). A DT is an assembly of transcribed sequences representing transcripts of the same splice form, and a DG is a group of DTs representing transcripts from the same gene. The goal of DoTS is to generate relationships among genes, RNAs, proteins, and their sequences to assist in discovering new genes, functions, genomic relationships (e.g. clusters by location), and regulation of gene expression. Allgenes.org is the website for public access to DoTS. As a human and mouse transcriptome resource, data in DoTS are organized around transcribed sequences, as Unigene and TIGR TCs do. DoTS and TIGR TCs provide consensus sequences and persistent identifiers, both of which Unigene lacks. Although DoTS and TIGR TCs are very similar in the degree of annotation performed and, as recently reported, in the assemblies generated [20], the two are not identical because of differences in the details of their clustering and assembly processes. For example DoTS has more consensus transcripts but a smaller number of sequences per transcript than TIGR TCs. This may be due to less trimming of low quality sequences from the ends, a choice made for DoTS to better preserve representation of differentially processed transcripts. The DoTS transcript indices also differ from TIGR TCs in some of the annotations performed on the consensus sequences (e.g. gene trap associations,
Recommend
More recommend