BIOINFORMATICS Pages 873879 SST: an algorithm for finding - PDF document

Vol. 18 no. 6 2002 BIOINFORMATICS Pages 873–879 SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi 1, ∗ , Michael G. Walker 1 , James Z. Wang 2 and Wayne Volkmuth 1 1 Incyte Pharmaceuticals, 3174 Porter Drive, Palo Alto, CA 94304, USA and 2 Department of Computer Science, Pennsylvania State University, University Park, PA 16802, USA Received on January 11, 2001; revised on January 7, 2002; accepted on January 29, 2002 ABSTRACT the tree index), SST 27 times faster than BLAST. Availability: Request from the authors. Motivation: Searches for near exact sequence matches are performed frequently in large-scale sequencing Contact: egiladi@incyte.com; mwalker@incyte.com projects and in comparative genomics. The time and cost of performing these large-scale sequence-similarity 1 INTRODUCTION searches is prohibitive using even the fastest of the extant In the current efforts to generate and interpret the complete algorithms. Faster algorithms are desired. genome sequences of humans and model organisms, Results: We have developed an algorithm, called SST large scale searches for near-exact matches are frequently (Sequence Search Tree), that searches a database of performed. Examples include programs that assemble DNA sequences for near-exact matches, in time propor- DNA from shotgun sequencing projects which initially tional to the logarithm of the database size n . In SST, we search for overlapping fragments, large-scale searches of partition each sequence into fragments of fixed length EST databases against genomic databases to determine the called ‘windows’ using multiple offsets. Each window is location of genes, and cross species genomic comparisons mapped into a vector of dimension 4 k which contains the between very closely related genomes. Faster algorithms frequency of occurrence of its component k -tuples, with k are needed because the time and cost of performing these a parameter typically in the range 4–6. Then we create a large-scale sequence-similarity searches using even the tree-structured index of the windows in vector space, with fastest of the extant algorithms is prohibitive. tree-structured vector quantization (TSVQ). We identify the nearest neighbors of a query sequence by partitioning 1.1 Previous related research the query into windows and searching the tree-structured We now review previous results related to the Sequence index for nearest-neighbor windows in the database. When Search Tree (SST) algorithm for sequence alignment, tree- the tree is balanced this yields an O ( log n ) complexity for structured indexes, and k -tuple encoding and filtration. In the search. This complexity was observed in our compu- this discussion we shall refer to the length of a query tations. SST is most effective for applications in which the sequence by the letter ‘ m ’. The size of the database refers target sequences show a high degree of similarity to the to the sum of the lengths of all the sequences in the query sequence, such as assembling shotgun sequences database, and is represented by the letter ‘ n ’. or matching ESTs to genomic sequence. The algorithm is also an effective filtration method. Specifically, it can be 1.1.1 Sequence alignment. Extant widely used sequence-similarity-finding programs include Needleman– used as a preprocessing step for other search methods Wunsch (Needleman and Wunsch, 1970), Smith– to reduce the complexity of searching one large database Waterman (Smith and Waterman, 1981), FASTA (Pearson against another. For the problem of identifying overlapping and Lipman, 1988; Pearson, 1996) and BLAST (Altschul fragments in the assembly of 120 000 fragments from et al. , 1990, 1997). a 1.5 megabase genomic sequence, SST is 15 times The Needleman–Wunsch and Smith–Waterman algo- faster than BLAST when we consider both building and rithms perform global and local sequence alignment using searching the tree. For searching alone (i.e. after building a dynamic programming algorithm. Their computational ∗ To whom correspondence should be addressed. complexity is O ( m ∗ n ) . 873 � Oxford University Press 2002 c

BIOINFORMATICS Pages 873879 SST: an algorithm for finding - PDF document

Vol. 18 no. 6 2002 BIOINFORMATICS Pages 873879 SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi 1, , Michael G. Walker 1 , James Z. Wang 2 and Wayne

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

GBTK: A Toolkit for Grid I mplementation of BLAST Dr.Rajendra R. Joshi and Satish Kumar M.

Sequence Alignment and Approaches to Database Searching Jessica Kissinger WHO-TDR Delhi 2005

Project Simple Annotation Pipeline - Ranjit Kumaresan Simple Annotation Pipeline Run a gene

St. Clair Reservoir Rehabilitation at Sir Winston Churchill Park September 20, 2017 Sir Winston

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

Grid Activities in Taiwan Eric Yen ASGC, Taiwan ISGC 2006 2 May 2006 Academia Sinica Grid

New Jersey State Plan Presentation ______________________________________________ September 2012

Operate a Presentation Package

Sambuz

Useful Links

Newsletter

Mail Us

BIOINFORMATICS Pages 873879 SST: an algorithm for finding - PDF document

Vol. 18 no. 6 2002 BIOINFORMATICS Pages 873879 SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi 1, , Michael G. Walker 1 , James Z. Wang 2 and Wayne

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

GBTK: A Toolkit for Grid I mplementation of BLAST Dr.Rajendra R. Joshi and Satish Kumar M.

Sequence Alignment and Approaches to Database Searching Jessica Kissinger WHO-TDR Delhi 2005

Project Simple Annotation Pipeline - Ranjit Kumaresan Simple Annotation Pipeline Run a gene

St. Clair Reservoir Rehabilitation at Sir Winston Churchill Park September 20, 2017 Sir Winston

Analysis and classification of the DNA Analysis and classification of the DNA sequence of TARA

Grid Activities in Taiwan Eric Yen ASGC, Taiwan ISGC 2006 2 May 2006 Academia Sinica Grid

New Jersey State Plan Presentation ______________________________________________ September 2012

Operate a Presentation Package

Sambuz

Useful Links

Newsletter

Mail Us

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt