Sequence Alignment (chapter 6) The biological problem l Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 22

Background: comparative genomics Basic question in biology: what properties are shared l among organisms? Genome s equencing allows comparison of organisms l at DNA and protein levels Comparisons can be used to l − Find evolutionary relationships between organisms − Identify functionally conserved sequences − Identify corresponding genes in human and model organisms: develop models for human diseases Introduction to bioinformatics, Autumn 2006 23

Homologs • Two genes or characters g B and g C evolved from the same ancestor g A are g A = agt gt ccgt t aagt gcgt t c called homologs g B = agt gccgt t aaagt t gt acgt c • Homologs usually exhibit conserved functions g C = ct gact gt t t gt ggt t c • Close evolutionary relationship => expect a high number of homologs Introduction to bioinformatics, Autumn 2006 24

Sequence similarity Intuitively, similarity of two sequences refers to the l degree of match between corresponding positions in sequence agt gccgt t aaagt t gt acgt c ct gact gt t t gt ggt t c What about sequences that differ in length? l Introduction to bioinformatics, Autumn 2006 25

Similarity vs homology Sequence similarity is not sequence homology l − If the two sequences g B and g C have accumulated enough mutations, the similarity between them is likely to be low #mutations #mutations 0 agt gt ccgt t aagt gcgt t c 64 acagt ccgt t cgggct at t g 1 agt gt ccgt t at agt gcgt t c 128 cagagcact accgc 2 agt gt ccgct t at agt gcgt t c 256 cacgagt aagat at agct 4 agt gt ccgct t aagggcgt t c 512 t aat cgt gat a 8 agt gt ccgct t caaggggcgt 1024 accct t at ct act t cct ggagt t 16 gggccgt t cat gggggt 2048 agcgacct gcccaa 32 gcagggcgt cact gagggct 4096 caaac Homology is more difficult to detect over greater evolutionary distances. Introduction to bioinformatics, Autumn 2006 26

Similarity vs homology (2) Sequence similarity can occur by chance l − Similarity does not imply homology Similarity is an expected consequence of homology l Introduction to bioinformatics, Autumn 2006 27

Orthologs and paralogs We distinguish between two types of homology l − Orthologs: homologs from two different species − Paralogs: homologs within a species Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 28

Orthologs and paralogs (2) Orthologs typically retain the original function l In paralogs, one copy is free to mutate and acquire l new function (no selective pressure) Organism A g A g A Gene A is copied g A g A’ within organism A g B g C g B g C Organism B Organism C Introduction to bioinformatics, Autumn 2006 29

Sequence alignment Alignment specifies which positions in two sequences l match acgtctag acgtctag acgtctag || ||||| || ||||| actctag- -actctag ac-tctag 2 matches 5 matches 7 matches 5 mismatches 2 mismatches 0 mismatches 1 not aligned 1 not aligned 1 not aligned Introduction to bioinformatics, Autumn 2006 30

Mutations: Insertions, deletions and substitutions acgtctag Indel: insertion or Mismatch: substitution ||||| deletion of a base (point mutation) of with respect to the a single base -actctag ancestor sequence Insertions and/or deletions are called indels l − We can’t tell whether the ancestor sequence had a base or not at indel position Introduction to bioinformatics, Autumn 2006 31

Problems What sorts of alignments should be considered? l How to score alignments? l How to find optimal or good scoring alignments? l How to evaluate the statistical significance of scores? l In this course, we discuss the first three problems. Course Biological sequence analysis tackles all four in- depth. Introduction to bioinformatics, Autumn 2006 32

Global alignment Problem: find optimal scoring alignment between two l sequences (Needleman & Wunsch 1970) We give score for each position in alignment l WHAT − Identity (match) +1 − Substitution (mismatch) -µ || �� − Indel WH-Y S(WHAT/WH-Y) = 1 + 1 – � – µ Introduction to bioinformatics, Autumn 2006 34

Representing alignments and scores WHAT - W H A T || - WH-Y W X H X X Y X Introduction to bioinformatics, Autumn 2006 35

Representing alignments and scores WHAT - W H A T || - 0 WH-Y W 1 H 2 2- � Global alignment Y 2- � -µ score S 3,4 = 2- � -µ Introduction to bioinformatics, Autumn 2006 36

Dynamic programming How to find the optimal alignment? l We use previous solutions for optimal alignments of l smaller subsequences This general approach is known as dynamic l programming Introduction to bioinformatics, Autumn 2006 37

Filling the alignment matrix - W H A T Consider the alignment process at shaded square. - Case 1. Align H against H (match or substitution). W Case 1 Case 2. Align H in WHY against Case 2 – (indel) in WHAT. H Case 3 Case 3. Align H in WHAT against – (indel) in WHY. Y Introduction to bioinformatics, Autumn 2006 38

Filling the alignment matrix (2) - W H A T Scoring the alternatives. Case 1. S 2,2 = S 1,1 + s(2, 2) - Case 2. S 2,2 = S 1,2 � � W Case 3. S 2,2 = S 2,1 � � Case 1 Case 2 s(i, j) = 1 for matching positions, H s(i, j) = - µ for substitutions. Case 3 Y Choose the case (path) that yields the maximum score. Keep track of path choices. Introduction to bioinformatics, Autumn 2006 39

Global alignment: formal development A = a 1 a 2 a 3 …a n , 0 1 2 3 4 B = b 1 b 2 b 3 …b m - b 1 b 2 b 3 b 4 b 1 b 2 b 3 b 4 - 0 - - a 1 - a 2 a 3 l Any alignment can be written 1 a 1 as a unique path through the matrix 2 a 2 l Score for aligning A and B up to positions i and j: 3 a 3 S i,j = S(a 1 a 2 a 3 …a i , b 1 b 2 b 3 …b j ) Introduction to bioinformatics, Autumn 2006 40

Scoring partial alignments Alignment of A = a 1 a 2 a 3 …a n with B = b 1 b 2 b 3 …b m can end in l three ways − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) - − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 41

Scoring alignments Scores for each case: l +1 if a i = b j s(a i , b j ) = { -µ otherwise − Case 1: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j-1 ) b j − Case 2: (a 1 a 2 …a i-1 ) a i (b 1 b 2 …b j ) – s(a i , -) = s(-, b j ) = - � − Case 3: (a 1 a 2 …a i ) – (b 1 b 2 …b j-1 ) b j Introduction to bioinformatics, Autumn 2006 42

Scoring alignments (2) • First row and first column 0 1 2 3 4 correspond to initial alignment against indels: - b 1 b 2 b 3 b 4 S(i, 0) = -i � S(0, j) = -j � �� -2 � -3 � -4 � 0 0 - • Optimal global alignment �� 1 a 1 score S(A, B) = S n,m -2 � 2 a 2 -3 � 3 a 3 Introduction to bioinformatics, Autumn 2006 43

Algorithm for global alignment I nput sequences A, B, n = | A|, m = |B| Set S i,0 := - � i f or all i Set S 0,j := - � j f or all j f or i := 1 t o n f or j := 1 t o m S i,j := max{S i-1,j – � , S i-1,j -1 + s(a i ,b j ), S i,j -1 – � } end end Algorithm takes O(nm) time and space. Introduction to bioinformatics, Autumn 2006 44

Global alignment: example - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 T -4 C -6 G -8 T -10 ? Introduction to bioinformatics, Autumn 2006 45

Global alignment: example (2) - T G G T G µ = 1 - 0 -2 -4 -6 -8 -10 � = 2 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 ATCGT- G -8 -5 -2 -1 -3 -4 | || T -10 -7 -4 -3 0 -2 -TGGTG Introduction to bioinformatics, Autumn 2006 46

Local alignment: rationale • Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- � receptor (right). The shared function here is protein kinase. Introduction to bioinformatics, Autumn 2006 48

Local alignment: rationale A B Regions of similarity • Global alignment would be inadequate • Problem: find the highest scoring local alignment between two sequences • Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) Introduction to bioinformatics, Autumn 2006 49

From global to local alignment Modifications to the global alignment algorithm l − Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix) − Allow preceding and trailing indels without penalty Introduction to bioinformatics, Autumn 2006 50

Sequence Alignment (chapter 6) The biological problem l Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 22 Background: comparative genomics Basic question in biology: what properties

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Part 2 Comparative Analysis of RNAs S.Will, 18.417, Fall 2011 Example Given: set of

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

INF 111 / CSE 121: Software Tools and Methods Lecture Notes for Fall Quarter, 2007 Michele

Information & Entropy Comp 595 DM Professor Wang Information & Entropy Information

Outline What is EMBOSS? Major programs Running EMBOSS Programs from the Unix

Sequence Analysis with TraMineR Gilbert Ritschard Institute for Demographic and Life Course

GPU accelerated partial order multiple sequence alignment for long reads self-correction

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy

Sequence Alignment (chapter 6) The biological problem l Global - PowerPoint PPT Presentation

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l Multiple alignment l Introduction to bioinformatics, Autumn 2006 22 Background: comparative genomics Basic question in biology: what properties

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 5/29/2013 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/12/2018 Mark Voorhies Sequence Alignment Exercise: Scoring

Sequence Alignment Mark Voorhies 4/24/2012 Mark Voorhies Sequence Alignment Exercise:

CSE 421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

CSE 427 Comp Bio Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

CSE421 Algorithms Sequence Alignment 1 Sequence Alignment What Why A Dynamic Programming

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Part 2 Comparative Analysis of RNAs S.Will, 18.417, Fall 2011 Example Given: set of

CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley www. www.cse cse.

INF 111 / CSE 121: Software Tools and Methods Lecture Notes for Fall Quarter, 2007 Michele

Information &amp; Entropy Comp 595 DM Professor Wang Information &amp; Entropy Information

Outline What is EMBOSS? Major programs Running EMBOSS Programs from the Unix

Sequence Analysis with TraMineR Gilbert Ritschard Institute for Demographic and Life Course

GPU accelerated partial order multiple sequence alignment for long reads self-correction

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Information & Entropy Comp 595 DM Professor Wang Information & Entropy Information