sequence alignment chapter 6
play

Sequence Alignment (chapter 6) The biological problem l Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 22 Background: comparative genomics Basic question in biology: what properties


  1. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 22

  2. Background: comparative genomics Basic question in biology: what properties are shared l among organisms? Genome s equencing allows comparison of organisms l at DNA and protein levels Comparisons can be used to l − Find evolutionary relationships between organisms − Identify functionally conserved sequences − Identify corresponding genes in human and model organisms: develop models for human diseases Introduction to bioinformatics, Autumn 2006 23

  3. Homologs • Two genes or characters g B and g C evolved from the same ancestor g A are g A = agt gt ccgt t aagt gcgt t c called homologs g B = agt gccgt t aaagt t gt acgt c • Homologs usually exhibit conserved functions g C = ct gact gt t t gt ggt t c • Close evolutionary relationship => expect a high number of homologs Introduction to bioinformatics, Autumn 2006 24

  4. Sequence similarity Intuitively, similarity of two sequences refers to the l degree of match between corresponding positions in sequence agt gccgt t aaagt t gt acgt c ct gact gt t t gt ggt t c What about sequences that differ in length? l Introduction to bioinformatics, Autumn 2006 25

  5. Similarity vs homology Sequence similarity is not sequence homology l − If the two sequences g B and g C have accumulated enough mutations, the similarity between them is likely to be low #mutations #mutations 0 agt gt ccgt t aagt gcgt t c 64 acagt ccgt t cgggct at t g 1 agt gt ccgt t at agt gcgt t c 128 cagagcact accgc 2 agt gt ccgct t at agt gcgt t c 256 cacgagt aagat at agct 4 agt gt ccgct t aagggcgt t c 512 t aat cgt gat a 8 agt gt ccgct t caaggggcgt 1024 accct t at ct act t cct ggagt t 16 gggccgt t cat gggggt 2048 agcgacct gcccaa 32 gcagggcgt cact gagggct 4096 caaac Homology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, Autumn 2006 26

  6. Similarity vs homology (2) Sequence similarity can occur by chance l − Similarity does not imply homology Similarity is an expected consequence of homology l Introduction to bioinformatics, Autumn 2006 27

  7. Orthologs and paralogs We distinguish between two types of homology l − Orthologs: homologs from two different species − Paralogs: homologs within a species Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 28

  8. Orthologs and paralogs (2) Orthologs typically retain the original function l In paralogs, one copy is free to mutate and acquire l new function (no selective pressure) Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 29

  9. Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2006 30

  10. Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2006 31

  11. Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss the first three problems. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2006 32

  12. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 33

  13. Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2006 34

  14. Representing alignments and scores WHAT - W H A T || - WH-Y W X H X X Y X Introduction to bioinformatics, Autumn 2006 35

  15. Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2006 36

  16. Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2006 37

  17. Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3 Case 3. Align H in WHAT against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2006 38

  18. Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2006 39

  19. Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2006 40

  20. Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 41

  21. Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 42

  22. Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2006 43

  23. Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2006 44

  24. Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2006 45

  25. Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2006 46

  26. Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 47

  27. Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2006 48

  28. Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2006 49

  29. From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix) − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2006 50

Recommend


More recommend