Sequence Alignment and Approaches to Database Searching Jessica Kissinger WHO-TDR Delhi 2005 Ian Korf, and M. Yandell O’Reilly Publishing The Growth of GenBank 35 40 Sequence records Total base pairs 35 30 Release 140: 32.5 million records 30 Sequence Records 25 Total Base Pairs 37.9 billion nucleotides (millions) 25 (billions) 20 Average doubling time ≈ 12 months 20 15 15 10 10 5 5 0 0 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 http://www.ncbi.nlm.nih.gov/BLAST/ 1
Outline • Back up talk about genesis of an idea • Global alignment Needleman-Wunsch • Local alignment Smith-Waterman • Scoring matrices • Need heuristic • FASTA • How does Blast work? • How you optimize Blast searches • Blast Variants Origins of “similar” sequences Why do we align sequences? A Gene A1 A2 Duplication A1 A2 Gene Speciation • To discover functional, structural and Duplication A1 A2 A1 A2 evolutionary similarities Species A Species B • Because “similarity” may be an indicator of “homology” and thus provide some insight into function or gene identification. Gene Conversion Horizontal Gene Transfer Genome biases for A/T, G/C, Serine/Glycine rich sequences; Low complexity sequences, e.g. LLLLLLLL, ATATATATA 2
Algorithms: definition Alignments • Alignment types: Webster’s definition: – global/local “a procedure for solving a mathematical – gapped/ungapped problem in a finite number of steps that – pairwise/multiple frequently involves a repetition of an operation; or broadly : a step-by-step • In what follows we will focus on pairwise procedure for solving a problem or alignments . accomplishing some end” Pairwise Alignment Global vs Local Alignment • There are two types of pairwise alignments • Global – Global (Needleman-Wunsch) L G P S S K Q T G K G S - S R I W D N • Compare two sequences in their entirety • Insert gaps as necessary to make the sequences the L N - I T K S A G K G A I M R L G D - same lengths • Local – Local (Smith-Waterman) - - - - - - - - G K G - - - - - - - - • Compare a portion of one sequence to a portion of another - - - - - - - - G K G - - - - - - - - • Look for the “best” possible alignment of sub- regions Substitutions, Insertions, Deletions Example • Mutation : one of • For the previous example – switch from one nucleotide to another cggtatgcca → cgggtatccaa , ccctaggtccca , the two – insertion descendent sequences align as follows – deletion • Substitution : a switch in nucleotides which spreads throughout most of a species. c g g g t a - - t - c c a a • Substitutions, insertions and deletions passed along two independent c c c - t a g g t c c c - a lines of descent cause a divergence of the two sequences from the • “-” ( indel ) represents an insertion or original (and from each other): cgggtatccaa deletion. cggtatgcca ccctaggtccca 3
Scoring schemes Alignments (cont.) • Given a scoring scheme, – an optimal alignment between two sequences is one with the best • Given two sequences, find an “optimal” alignment score (there might be more than one optimal alignment). – the score of the sequence pair is such a best score. between them and use it to answer the questions stated • Using the scores of sequence pairs one can: above. – investigate the hypothesis that two sequences diverged from a • What is an “optimal” alignment? common ancestor • Need a way to compare alignments. – use the alignment of a pair of sequences that are judged to be related in order to discover common patterns. – Attach a score to each alignment. – by comparing scores among different species, get information to – This should reflect the likelihood that this alignment was produced help reconstruct the phylogenetic tree that relates them all. as a consequence of divergence from a common ancestor. GAP PENALTIES Types of scores Linear = #gaps x penalty Affine = Opening penalty + #gaps x extension penalty • Similarity Scores: the higher the score, the more closely related are the two aligned SIMPLE PENALTY OF GAP GAP AFFINE sequences. PENALTY GAP PENALTY • Distance scores (or distance measures): the lower the score, the more closely related the sequences. In what follows we will use similarity score. LENGTH OF GAP Substitution Matrices Example • A 4 × 4 (NA) or a 20 × 20 (AA) symmetric matrix. • Let s(X,Y)=1 if X=Y , -1 otherwise and use • Example: a linear gap score with d=-2 . Then the 1. s(X,Y)=1 if X=Y , -1 otherwise. score of the alignment • In what follows we will assume that a scoring c t t a g - g - - scheme, consisting of a substitution matrix and a gap penalty function, is given. c a t - g a g a a is 1 –1 +1 -2 +1 -2 +1 -2 - 2 = -5 4
Needleman-Wunsch algorithm (1970) Naïve approach Gotoh’s version (1982) Exhaustive search: • This is an example of dynamic programming algorithm: • List all possible global gapped alignments of x and y . – break the problem into sub-problems of the same kind • For each such alignment, compute its score – build the final solution using the solutions for using the given scoring scheme. the sub-problems. • Find the maximum of the scores and the corresponding alignment(s). C O E L A C A N T H Align: COELACANTH and PELICAN C P O Two possible (out of many) global alignments P - E E COELACANTH- E COELACANTH PELICAN-- P-ELICAN-- L L L I A The best local alignment C I --ELACAN-- C --ELICAN-- A C A N A Scoring system: Match = +1 N T H N - - Mismatch = -1 Gaps = -1 Sequences align when we are on the diagonal, when gaps are Introduced, we move vertically (or horizontally). Alignment types and their scores Global Alignment (Needleman Wunsch) - Linear gap model B( i , j )= max {B( i -1, j -1) +s( i , j ), B( i -1, j )-d, B( i , j -1)-d} • Global L G P S S K Q T G K G S - S R I W D N X i X i - Y j - Y j L N - I T K S A G K G A I M R L G D - Global - penalize all gaps Fitting one sequence into another - Linear gap model F( i , j )= max {F( i -1, j -1) +s( i , j ), F( i -1, j )-d, F( i , j -1)-d} Fit one inside another - only penalize gaps in the shorter sequence • Local Local Alignment (Smith Waterman) - Linear gap model - - - - - - - - G K G - - - - - - - - L( i , j )= max {L( i -1, j -1) +s( i , j ), L( i -1, j )-d, L( i , j -1)-d} - - - - - - - - G K G - - - - - - - - Local - only penalize gaps within the region aligned 5
Example Example (cont.) • x =gaatct , y =catt ( m=6 and n=4 ) • s(X,Y)=1 if X=Y , -1 otherwise • d=2 • 3 optimal alignments: gaatct gaatct gaatct c-at-t ca-t-t -cat-t Database Searching Similarity and Homology • Database Searching ≠ Sequence alignment • Sequence homology can be reliably inferred from • Similarity ≠ Homology statistically significant similarity over a majority of the • Similarity is a measure of “sameness”. It is expressed as a sequence length. percentage, and it does not imply any reasons for the • Non-homology CANNOT be inferred from non-similarity observed sameness, it is simply a measure of the observed because non-similar things can still share a common likeness. ancestor. • Homology is an evolutionary term used to describe relationship via descent from a common ancestor. • Homologous proteins share common structures, but not Homologous things are often similar, but not always, for necessarily common sequence or function. example the flipper of a whale and your arm, or the DNA sequence for Actin in humans and chickens. Homology is NEVER expressed as a percent, either you are related or you aren’t. Origins of similarity NOT based Similarity Assessment on common ancestry • Similarity is often observed in regions of low sequence • Our assumption is that unrelated sequences will behave complexity, I.e. SSSSSS or ATATATATATAT, such like random sequences similarity is also almost always local and will not span the • Biological sequences are not random, so the statistics of length of the sequences being compared. extreme value distributions apply. • Similarity can also be caused by underlying biases in • Scores for matches are influenced by the scoring matrix nucleotide or amino acid usage used • Similarity can be caused by shared motifs that have been • Sensitivity and Selectivity are affected by choice of acquired. matrix and choice of database (redundancy and size). • Choice of search molecule (query) 6
Recommend
More recommend