csi5126 algorithms in bioinformatics
play

CSI5126 . Algorithms in bioinformatics Substitution Score Marcel - PowerPoint PPT Presentation

. PAM . . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Preamble . Signifjcance Models Substitutions Markov Chains PAM CSI5126 . Algorithms in bioinformatics Substitution Score Marcel Turcotte School of


  1. . Markov Chains . . . . . . . . Preamble Signifjcance Models Substitutions PAM . Preamble Signifjcance Models Substitutions Markov Chains PAM Probabilistic Interpretation of a Sequence Alignment ungaped alignments are considered. The interpretation requires weighting two outcomes . 1. Sequences are related (Match Model – M) 2. Sequences are unrelated ( Random Model – R) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Consider two aligned sequences, S 1 and S 2 . For simplicity, S 1 ( 1 ) S 1 ( 2 ) . . . S 1 ( n ) S 2 ( 1 ) S 2 ( 2 ) . . . S 2 ( n )

  2. . Markov Chains . . . . . . . . Preamble Signifjcance Models Substitutions PAM . Preamble Signifjcance Models Substitutions Markov Chains PAM Probabilistic Interpretation of a Sequence Alignment ungaped alignments are considered. The interpretation requires weighting two outcomes . 1. Sequences are related (Match Model – M) 2. Sequences are unrelated ( Random Model – R) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Consider two aligned sequences, S 1 and S 2 . For simplicity, S 1 ( 1 ) S 1 ( 2 ) . . . S 1 ( n ) S 2 ( 1 ) S 2 ( 2 ) . . . S 2 ( n )

  3. . Markov Chains . . . . . . . . Preamble Signifjcance Models Substitutions PAM . Preamble Signifjcance Models Substitutions Markov Chains PAM Probabilistic Interpretation of a Sequence Alignment ungaped alignments are considered. The interpretation requires weighting two outcomes . 1. Sequences are related (Match Model – M) 2. Sequences are unrelated ( Random Model – R) Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . Consider two aligned sequences, S 1 and S 2 . For simplicity, S 1 ( 1 ) S 1 ( 2 ) . . . S 1 ( n ) S 2 ( 1 ) S 2 ( 2 ) . . . S 2 ( n )

  4. . Markov Chains . . . . . . . . Preamble Signifjcance Models Substitutions PAM . Preamble Signifjcance Models Substitutions Markov Chains PAM Match model In the match model , we have, i have both been derived independently from an ancestral residue c . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . ∏ P ( S 1 , S 2 | M ) = q ( S 1 ( i ) , S 2 ( i )) where q ( a , b ) represents the probability that both residues a and b S1(0)S1(1)...S1(n) S1(i) S(0)S(1)...S(n) S(i) S2(0)S2(1)...S2(n) S2(i)

  5. . Markov Chains . . . . . . . . Preamble Signifjcance Models Substitutions PAM . Preamble Signifjcance Models Substitutions Markov Chains PAM Random model Whilst the random model is simply, i j i Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics ∏ ∏ P ( S 1 , S 2 | R ) = p S 1 ( i ) p S 2 ( j ) but since we assumed that | S 1 | = | S 2 | , ∏ P ( S 1 , S 2 | R ) = p S 1 ( i ) p S 2 ( i )

  6. log q S 1 i S 2 i S S 1 S 2 p S 1 i p S 2 i log q a b The ratio of the two likelihoods is called an odds-ratio (or as score matrix or substitution matrix . 20 matrix, known In the case of proteins s a b represents a 20 occur as an aligned pair, as opposed to unaligned. likelihood-ratio ), p a p b s a b where each, represents the log-likelihood ratio that the residue pair a b will i ratio , taking the logarithm leads to a quantity known as the log-odds i P ( S 1 , S 2 | M ) q ( S 1 ( i ) , S 2 ( i )) ∏ P ( S 1 , S 2 | R ) = p S 1 ( i ) p S 2 ( i ) . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  7. The ratio of the two likelihoods is called an odds-ratio (or ratio , In the case of proteins s a b represents a 20 occur as an aligned pair, as opposed to unaligned. likelihood-ratio ), p a p b where each, as score matrix or substitution matrix . i taking the logarithm leads to a quantity known as the log-odds i 20 matrix, known P ( S 1 , S 2 | M ) q ( S 1 ( i ) , S 2 ( i )) ∏ P ( S 1 , S 2 | R ) = p S 1 ( i ) p S 2 ( i ) log ( q ( S 1 ( i ) , S 2 ( i )) ∑ S ( S 1 , S 2 ) = ) p S 1 ( i ) p S 2 ( i ) s ( a , b ) = log ( q ( a , b ) ) represents the log-likelihood ratio that the residue pair ( a , b ) will . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  8. The ratio of the two likelihoods is called an odds-ratio (or ratio , likelihood-ratio ), p a p b where each, i occur as an aligned pair, as opposed to unaligned. as score matrix or substitution matrix . taking the logarithm leads to a quantity known as the log-odds i P ( S 1 , S 2 | M ) q ( S 1 ( i ) , S 2 ( i )) ∏ P ( S 1 , S 2 | R ) = p S 1 ( i ) p S 2 ( i ) log ( q ( S 1 ( i ) , S 2 ( i )) ∑ S ( S 1 , S 2 ) = ) p S 1 ( i ) p S 2 ( i ) s ( a , b ) = log ( q ( a , b ) ) represents the log-likelihood ratio that the residue pair ( a , b ) will In the case of proteins s ( a , b ) represents a 20 × 20 matrix, known . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  9. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  10. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  11. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  12. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  13. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  14. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  15. In this view, the total score of alignment is the sum of calculating local sequence alignments , since likely all the terms for the aligned pairs of residues and gaps. another, i.e. mutations at difgerent sites have occurred the sequence are considered independent from one Additive scoring scheme means that positions along alignment will have a negative score. alignments will have a positive score and unlikely independently. It’s a working hypothesis . We see that such substitution matrix can be used for log-likelihood ratio will be zero . Finally, when the two hypotheses are equally likely the Negative terms represent unfavorable substitutions. than would be expected by chance. Positive terms represent substitutions are more likely likelihood that the sequences are related vs not related . The score is interpreted as the logarithm of the relative . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  16. . What about the Substitution Scores ? . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM The substitution scores that we used were rather arbitrary , either . from one amino acid type to another is only one (Ala and Marcel Turcotte these efgects. The substitution score is expected to refmect both of or even three (Asn and Trp, AAC or AAU and UGG). minimum of two mutations (Ala and Arg, CGA and GCA) Asp, GCC and GAC), there are pairs that need a of mutations at the codon level to change the encoding the identity matrix or some hand made matrix. pairs of amino acids are such that the minimum number Looking at the genetic code , you can see that certain volume, charge, hydrophobicity, etc.) Certain amino acids have similar properties (structure, protein sequences . Let’s have a look at scoring schemes that are appropriate for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  17. . What about the Substitution Scores ? . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM The substitution scores that we used were rather arbitrary , either . from one amino acid type to another is only one (Ala and Marcel Turcotte these efgects. The substitution score is expected to refmect both of or even three (Asn and Trp, AAC or AAU and UGG). minimum of two mutations (Ala and Arg, CGA and GCA) Asp, GCC and GAC), there are pairs that need a of mutations at the codon level to change the encoding the identity matrix or some hand made matrix. pairs of amino acids are such that the minimum number Looking at the genetic code , you can see that certain volume, charge, hydrophobicity, etc.) Certain amino acids have similar properties (structure, protein sequences . Let’s have a look at scoring schemes that are appropriate for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  18. . What about the Substitution Scores ? . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM The substitution scores that we used were rather arbitrary , either . from one amino acid type to another is only one (Ala and Marcel Turcotte these efgects. The substitution score is expected to refmect both of or even three (Asn and Trp, AAC or AAU and UGG). minimum of two mutations (Ala and Arg, CGA and GCA) Asp, GCC and GAC), there are pairs that need a of mutations at the codon level to change the encoding the identity matrix or some hand made matrix. pairs of amino acids are such that the minimum number Looking at the genetic code , you can see that certain volume, charge, hydrophobicity, etc.) Certain amino acids have similar properties (structure, protein sequences . Let’s have a look at scoring schemes that are appropriate for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  19. . What about the Substitution Scores ? . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM The substitution scores that we used were rather arbitrary , either . from one amino acid type to another is only one (Ala and Marcel Turcotte these efgects. The substitution score is expected to refmect both of or even three (Asn and Trp, AAC or AAU and UGG). minimum of two mutations (Ala and Arg, CGA and GCA) Asp, GCC and GAC), there are pairs that need a of mutations at the codon level to change the encoding the identity matrix or some hand made matrix. pairs of amino acids are such that the minimum number Looking at the genetic code , you can see that certain volume, charge, hydrophobicity, etc.) Certain amino acids have similar properties (structure, protein sequences . Let’s have a look at scoring schemes that are appropriate for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  20. . What about the Substitution Scores ? . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM The substitution scores that we used were rather arbitrary , either . from one amino acid type to another is only one (Ala and Marcel Turcotte these efgects. The substitution score is expected to refmect both of or even three (Asn and Trp, AAC or AAU and UGG). minimum of two mutations (Ala and Arg, CGA and GCA) Asp, GCC and GAC), there are pairs that need a of mutations at the codon level to change the encoding the identity matrix or some hand made matrix. pairs of amino acids are such that the minimum number Looking at the genetic code , you can see that certain volume, charge, hydrophobicity, etc.) Certain amino acids have similar properties (structure, protein sequences . Let’s have a look at scoring schemes that are appropriate for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  21. . Models D (Asp) A (ALA) (20) Amino Acids PAM Markov Chains Substitutions Signifjcance K (Lys) Preamble PAM Markov Chains Substitutions Models Signifjcance Preamble E (Glu) P (Pro) . N (Asn) Marcel Turcotte T (Thr) F (Phe) L (Leu) H (His) Q (Gln) Y (Tyr) W (Trp ) S (Ser) M (Met) I (Ile) G (Gly) C (Cys) R (Arg) V (Val) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  22. aliphatic hydrophobic small M I L V aromatic tiny F A C Y W G H K S P T R D E N positive Q negative charged polar . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  23. . ACC ACU Thr AAU Asn AGU Ser U A AUC Ile Thr AUU AAC Asn AGC Ser C A AUA Ile ACA Thr AAA Ile A AGA Gln CAC His CGC Arg C C CUA Leu CCA Pro CAA CGA G Arg A C CUG Leu CCG Pro CAG Gln CGG Arg Lys Arg CCC Gly Gly C G GUA Val GCA Ala GAA Glu GGA A Asp G GUG Val GCG Ala GAG Glu GGG Gly G Marcel Turcotte GGC GAC A GUU A AUG Met ACG Thr AAG Lys AGG Arg G G Val Ala GCU Ala GAU Asp GGU Gly U G GUC Val GCC Pro Leu . Preamble . . . . . . . . . . Signifjcance . Models Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Genetic Code . . C . . . . . . . . . . . . . . . . . . . . . . . . . U A CUC Trp Stop G U UUG Leu UCG Ser UAG Stop UGG A Stop C CUU Leu CCU Pro CAU His CGU Arg U C UGA UAA G UUC U UUU Phe UCU Ser UAU Tyr UGU Cys U U Phe Ser UCC Ser UAC Tyr UGC Cys C U UUA Leu UCA CSI5126 . Algorithms in bioinformatics

  24. . Substitutions . . . . . . . . Preamble Signifjcance Models Markov Chains . PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Deriving Scores Could be derived from fjrst principles (chemical properties, etc.) Could be estimated from the data Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  25. . Preamble . . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Signifjcance . Models Substitutions Markov Chains PAM Pitfalls Sampling problem : sequences come into families Time dependent : for distant sequences, we’d expect the probability of a substitution to be large, and low if the two sequences are close homologues For short time periods, the infmuence of the genetic code is expected to be stronger than the chemical properties, the trend should be reversed for longer intervals. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  26. 86.5% identity; ::. 50 A VLSAADKGNVKAAWGKVGGHAAEYGAEALERMFLSFPTTKTYFPHFD-LSHGSAQ--VKG ::::.: :::..:.:. Global alignment score: 786 : . : .:. : :.:. B 30 SLSAAQKDNVKSSWAKA---SAAWGTAGPEFFMALFDAHDDVFAKFSGLFSGAAKGTVKN 10 20 30 40 50 alignment and the same substitution at position 15 in the second alignment, are those two substitutions equally likely? 40 .: .:. . : .. 20 :::::::.:::::::::::.:. .::::::::::.:::::::::::.::::: :::.:: 10 20 10 40 50 60 A VLSAADKGNVKAAWGKVGGHAAEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGA 30 B VLSAADKANVKAAWGKVGGQAGAHGAEALERMFLGFPTTKTYFPHFNLSHGSDQVKAHGQ 10 20 30 40 50 60 24.8% identity; Global alignment score: 46 ⇒ Consider the subtitution s(Gly,Ala) at position 8 of the fjrst . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  27. . Substitutions . . . . . . . . Preamble Signifjcance Models Markov Chains . PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Markov Chains We need a framework to model substitutions. Discrete-time homogeneous fjnite Markov chain models Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  28. Our presentation will be informal. An entire course could be taught on Markov chains and stochastic processes. MAT 4374 Modern Computational Statistics Simulation including the rejection method and importance sampling; applications to Monte Carlo Markov chains. Resampling methods such as the bootstrap and jackknife, with applications. Smoothing methods in curve estimation. MAT 5198 Stochastic Models Markov systems, stochastic networks, queuing networks, spatial processes, approximation methods in stochastic processes and queuing theory. Applications to the modelling and analysis of computer-communications systems and other distributed networks. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  29. . Models . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Substitutions . Markov Chains PAM Markov Chains Like fjnite state automata (FSA): Finite Markov chains allow to model processes which can be represented by a fjnite number of states . A process can be in any of these states at a given time ; for some discrete units of time t 0 1 2 . E.g. the amino acid type for a given sequence position at time t . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  30. . Models . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Substitutions . Markov Chains PAM Markov Chains Like fjnite state automata (FSA): Finite Markov chains allow to model processes which can be represented by a fjnite number of states . A process can be in any of these states at a given time ; for some discrete units of time t 0 1 2 . E.g. the amino acid type for a given sequence position at time t . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  31. . PAM . . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Preamble . Signifjcance Models Substitutions Markov Chains PAM Markov Chains Like fjnite state automata (FSA): Finite Markov chains allow to model processes which can be represented by a fjnite number of states . A process can be in any of these states at a given E.g. the amino acid type for a given sequence position at time t . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics time ; for some discrete units of time t = 0 , 1 , 2 , . . . .

  32. . PAM . . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Preamble . Signifjcance Models Substitutions Markov Chains PAM Markov Chains Like fjnite state automata (FSA): Finite Markov chains allow to model processes which can be represented by a fjnite number of states . A process can be in any of these states at a given E.g. the amino acid type for a given sequence position at time t . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics time ; for some discrete units of time t = 0 , 1 , 2 , . . . .

  33. If the current state of the process at time t is E i then at 1 either the process stays in E i or move to E j , for . Signifjcance Preamble PAM Markov Chains Substitutions Models Substitutions Signifjcance Preamble . . . . Models PAM Markov Chains . Markov Chains Unlike FSAs: The transitions from one state to another are stochastic (not deterministic). time t some j , according to a well defjned probability. E.g. at time t 1 the amino acid type for a given sequence position either stays the same of is substituted by one of the remaining 19 amino acid types, according to a well defjned probability, to be estimated. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  34. If the current state of the process at time t is E i then at 1 either the process stays in E i or move to E j , for . Signifjcance Preamble PAM Markov Chains Substitutions Models Substitutions Signifjcance Preamble . . . . Models PAM Markov Chains . Markov Chains Unlike FSAs: The transitions from one state to another are stochastic (not deterministic). time t some j , according to a well defjned probability. E.g. at time t 1 the amino acid type for a given sequence position either stays the same of is substituted by one of the remaining 19 amino acid types, according to a well defjned probability, to be estimated. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  35. . Models . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Substitutions . Markov Chains PAM Markov Chains Unlike FSAs: The transitions from one state to another are stochastic (not deterministic). some j , according to a well defjned probability. E.g. at time t 1 the amino acid type for a given sequence position either stays the same of is substituted by one of the remaining 19 amino acid types, according to a well defjned probability, to be estimated. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . If the current state of the process at time t is E i then at time t + 1 either the process stays in E i or move to E j , for

  36. . Preamble . . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Signifjcance . Models Substitutions Markov Chains PAM Markov Chains Unlike FSAs: The transitions from one state to another are stochastic (not deterministic). some j , according to a well defjned probability. sequence position either stays the same of is substituted by one of the remaining 19 amino acid types, according to a well defjned probability, to be estimated. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . If the current state of the process at time t is E i then at time t + 1 either the process stays in E i or move to E j , for E.g. at time t + 1 the amino acid type for a given

  37. . Signifjcance . . . . . . . . . . Preamble Models . Substitutions Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Markov Chains Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . 0.4 E3 0.1 0.6 0.8 E1 0.1 0.6 E2 0.4

  38. 2. Homogeneity of time . If a process is in state E i at time t then the probability that it will be in state E j at time . PAM . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Signifjcance Preamble . Models Substitutions Markov Chains PAM Properties A (fjrst-order) Markovian process must conform to the following 2 properties: Markovian process. t 1 is independent of t . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . 1. Memory less . If a process is in state E i at time t then the probability that it will be in state E j at time t + 1 only depends on E i (and not on the previous states visited at time t ′ < t , no history). This is called a fjrst-order

  39. . Substitutions . . . . . . . . Preamble Signifjcance Models Markov Chains . PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Properties A (fjrst-order) Markovian process must conform to the following 2 properties: Markovian process. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics 1. Memory less . If a process is in state E i at time t then the probability that it will be in state E j at time t + 1 only depends on E i (and not on the previous states visited at time t ′ < t , no history). This is called a fjrst-order 2. Homogeneity of time . If a process is in state E i at time t then the probability that it will be in state E j at time t + 1 is independent of t .

  40. Mutations are often modeled as the result of a Markovian 2. Also, the probability of A being replaced by B at t process . For a given protein, if the amino acid type found at a Sometimes the concept of time is replaced by that of space . This probability of A being substituted by B . now or 250 million years ago does not afgect the independent of t , i.e. the fact that this event is occuring 1 is allows to model dependencies along a protein or DNA sequence. not infmuence the probability of A being substituted by B . t does was previously found at this position for some t this position at time t , which is A , and the fact that C depends only on the current amino acid type found at 1 1. The probability that A is replaced by B at time t certain position is A at time t then: . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  41. Mutations are often modeled as the result of a Markovian process . For a given protein, if the amino acid type found at a certain position is A at time t then: depends only on the current amino acid type found at this position at time t , which is A , and the fact that C not infmuence the probability of A being substituted by B . 2. Also, the probability of A being replaced by B at t 1 is independent of t , i.e. the fact that this event is occuring now or 250 million years ago does not afgect the probability of A being substituted by B . Sometimes the concept of time is replaced by that of space . This allows to model dependencies along a protein or DNA sequence. 1. The probability that A is replaced by B at time t + 1 was previously found at this position for some t ′ < t does . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  42. Mutations are often modeled as the result of a Markovian process . For a given protein, if the amino acid type found at a certain position is A at time t then: depends only on the current amino acid type found at this position at time t , which is A , and the fact that C not infmuence the probability of A being substituted by B . independent of t , i.e. the fact that this event is occuring now or 250 million years ago does not afgect the probability of A being substituted by B . Sometimes the concept of time is replaced by that of space . This allows to model dependencies along a protein or DNA sequence. 1. The probability that A is replaced by B at time t + 1 was previously found at this position for some t ′ < t does 2. Also, the probability of A being replaced by B at t + 1 is . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  43. . Models . . . . . . . . . Preamble Signifjcance Substitutions . Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Markov chain A (fjrst-order) Markov chain is a sequence of random variables that satisfjes the following property Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics X 0 , . . . , X t − 1 , X t P ( X t = x t | X t − 1 = x t − 1 , X t − 2 = x t − 2 , . . . , X 0 = x 0 ) = P ( X t = x t | X t − 1 = x t − 1 )

  44. . PAM . . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Preamble . Signifjcance Models Substitutions Markov Chains PAM Markov chain More generally, a m - order Markov chain is a sequence of random variables: that satisfjes the following property a 0-order model is known as a Bernouilli model . Markov chain models are denoted Mm , where m is the order of the model, e.g. M 0, M 1, M 2, M 3, etc. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . X 0 , . . . , X t − 1 , X t P ( X t = x t | X t − 1 = x t − 1 , X t − 2 = x t − 2 , . . . , X 0 = x 0 ) = P ( X t = x t | X t − 1 = x t − 1 , . . . , X t − m = x m )

  45. . Models . . . . . . . . . Preamble . Substitutions . Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Transition Probabilities The transition probabilities , p ij , can be represented graphically, or as a transition probability matrix , Marcel Turcotte . Signifjcance . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . 0.4 E3 0.1 0.6 0.8 E1 0.1 0.6 E2 0.4   0 . 8 0 . 1 0 . 1   P = 0 . 6 0 . 4 0 . 0   0 . 6 0 . 0 0 . 4

  46. . Markov Chains . . . . . . . . Preamble Signifjcance Models Substitutions PAM . Preamble Signifjcance Models Substitutions Markov Chains PAM Transition Probabilities from state i (row) to state j (column). The values in a row represent all the transitions from state i , i.e. all outgoing arcs, and therefore their sum must be 1 . Marcel Turcotte . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . . . . . . . .   0 . 8 0 . 1 0 . 1   P = 0 . 6 0 . 4 0 . 0   0 . 6 0 . 0 0 . 4 where p ij is understood as the probability of a transition

  47. knowing that a random variable is in state E 2 at time t what is the probability that it will be state E 5 at t 2, i.e. after two transitions? For the Markovian process graphically depicted above, as this one, ‘ ‘a Markovian random variable is in state The framework allows to answer elegantly questions such E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 E4 0.6 E i at time t , what is the probability that it will be in state E j at t + 2 ? ” . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  48. For the Markovian process graphically depicted above, as this one, ‘ ‘a Markovian random variable is in state The framework allows to answer elegantly questions such E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 E4 0.6 E i at time t , what is the probability that it will be in state E j at t + 2 ? ” knowing that a random variable is in state E 2 at time t what is the probability that it will be state E 5 at t + 2, i.e. after two transitions? . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  49. The probability that E 2 E 2 E 5 is followed is The probability that E 2 E 3 E 5 is followed is The probability that E 2 E 4 E 5 is followed is found in E 5 at t 2 knowing that it was in E 2 at t is 0 04 0 4 0 04 0 2 0 2 There are exactly 3 paths of length 2 leading from E 2 0 1 0 04 0 4 Therefore, the probability that the random variable is 0 04 0 04 0 04 0 12. 0 1 E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 0.6 E4 to E 5 : ( E 2 , E 2 , E 5 ) , ( E 2 , E 3 , E 5 ) and ( E 2 , E 4 , E 5 ) . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  50. The probability that E 2 E 3 E 5 is followed is The probability that E 2 E 4 E 5 is followed is found in E 5 at t 2 knowing that it was in E 2 at t is 0 4 0 04 There are exactly 3 paths of length 2 leading from E 2 0 1 0 4 0 04 Therefore, the probability that the random variable is 0 1 0 04 0 04 0 04 0 12. E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 0.6 E4 to E 5 : ( E 2 , E 2 , E 5 ) , ( E 2 , E 3 , E 5 ) and ( E 2 , E 4 , E 5 ) . The probability that ( E 2 , E 2 , E 5 ) is followed is 0 . 2 × 0 . 2 = 0 . 04 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  51. The probability that E 2 E 4 E 5 is followed is found in E 5 at t 2 knowing that it was in E 2 at t is 0 12. 0 04 0 04 0 04 There are exactly 3 paths of length 2 leading from E 2 Therefore, the probability that the random variable is 0 04 0 4 0 1 E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 0.6 E4 to E 5 : ( E 2 , E 2 , E 5 ) , ( E 2 , E 3 , E 5 ) and ( E 2 , E 4 , E 5 ) . The probability that ( E 2 , E 2 , E 5 ) is followed is 0 . 2 × 0 . 2 = 0 . 04 The probability that ( E 2 , E 3 , E 5 ) is followed is 0 . 1 × 0 . 4 = 0 . 04 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  52. found in E 5 at t 2 knowing that it was in E 2 at t is 0 04 0 04 Therefore, the probability that the random variable is 0 12. There are exactly 3 paths of length 2 leading from E 2 0 04 E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 0.6 E4 to E 5 : ( E 2 , E 2 , E 5 ) , ( E 2 , E 3 , E 5 ) and ( E 2 , E 4 , E 5 ) . The probability that ( E 2 , E 2 , E 5 ) is followed is 0 . 2 × 0 . 2 = 0 . 04 The probability that ( E 2 , E 3 , E 5 ) is followed is 0 . 1 × 0 . 4 = 0 . 04 The probability that ( E 2 , E 4 , E 5 ) is followed is 0 . 1 × 0 . 4 = 0 . 04 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  53. Therefore, the probability that the random variable is There are exactly 3 paths of length 2 leading from E 2 E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 0.6 E4 to E 5 : ( E 2 , E 2 , E 5 ) , ( E 2 , E 3 , E 5 ) and ( E 2 , E 4 , E 5 ) . The probability that ( E 2 , E 2 , E 5 ) is followed is 0 . 2 × 0 . 2 = 0 . 04 The probability that ( E 2 , E 3 , E 5 ) is followed is 0 . 1 × 0 . 4 = 0 . 04 The probability that ( E 2 , E 4 , E 5 ) is followed is 0 . 1 × 0 . 4 = 0 . 04 found in E 5 at t + 2 knowing that it was in E 2 at t is 0 . 04 + 0 . 04 + 0 . 04 = 0 . 12. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  54. ij p ik p kj In general , the probability that a random variable is k E3 0.2 0.1 0.4 0.2 0.4 0.6 0.5 0.2 E1 E2 E5 0.1 0.4 0.8 0.5 E4 0.6 found in state E j at t + 2 knowing that it was in E i at t is, ∑ p ( 2 ) = . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  55. Hence, P 2 gives all the transition probabilities moving from state E i to E j in two units of time (steps). …which is the product of row i by column j of the What are all those zeros? transition probability matrix. This is also the element i j in the matrix P 2 !   0 . 2 0 . 8 0 . 0 0 . 0 0 . 0  0 . 4 0 . 2 0 . 1 0 . 1 0 . 2      P = 0 . 0 0 . 6 0 . 0 0 . 0 0 . 4     0 . 0 0 . 6 0 . 0 0 . 0 0 . 4   0 . 0 0 . 0 0 . 5 0 . 5 0 . 0   0 . 36 0 . 32 0 . 08 0 . 08 0 . 16 0 . 16 0 . 48 0 . 12 0 . 12 0 . 12     P 2 =   0 . 24 0 . 12 0 . 26 0 . 26 0 . 12     0 . 24 0 . 12 0 . 26 0 . 26 0 . 12   0 . 00 0 . 60 0 . 00 0 . 00 0 . 40 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  56. Hence, P 2 gives all the transition probabilities moving from state E i to E j in two units of time (steps). …which is the product of row i by column j of the What are all those zeros? transition probability matrix. This is also the element ( i , j ) in the matrix P 2 !   0 . 2 0 . 8 0 . 0 0 . 0 0 . 0  0 . 4 0 . 2 0 . 1 0 . 1 0 . 2      P = 0 . 0 0 . 6 0 . 0 0 . 0 0 . 4     0 . 0 0 . 6 0 . 0 0 . 0 0 . 4   0 . 0 0 . 0 0 . 5 0 . 5 0 . 0   0 . 36 0 . 32 0 . 08 0 . 08 0 . 16 0 . 16 0 . 48 0 . 12 0 . 12 0 . 12     P 2 =   0 . 24 0 . 12 0 . 26 0 . 26 0 . 12     0 . 24 0 . 12 0 . 26 0 . 26 0 . 12   0 . 00 0 . 60 0 . 00 0 . 00 0 . 40 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  57. …which is the product of row i by column j of the What are all those zeros? transition probability matrix. This is also the element ( i , j ) in the matrix P 2 ! Hence, P 2 gives all the transition probabilities moving from state E i to E j in two units of time (steps).   0 . 2 0 . 8 0 . 0 0 . 0 0 . 0  0 . 4 0 . 2 0 . 1 0 . 1 0 . 2      P = 0 . 0 0 . 6 0 . 0 0 . 0 0 . 4     0 . 0 0 . 6 0 . 0 0 . 0 0 . 4   0 . 0 0 . 0 0 . 5 0 . 5 0 . 0   0 . 36 0 . 32 0 . 08 0 . 08 0 . 16 0 . 16 0 . 48 0 . 12 0 . 12 0 . 12     P 2 =   0 . 24 0 . 12 0 . 26 0 . 26 0 . 12     0 . 24 0 . 12 0 . 26 0 . 26 0 . 12   0 . 00 0 . 60 0 . 00 0 . 00 0 . 40 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  58. …which is the product of row i by column j of the What are all those zeros? transition probability matrix. This is also the element ( i , j ) in the matrix P 2 ! Hence, P 2 gives all the transition probabilities moving from state E i to E j in two units of time (steps).   0 . 2 0 . 8 0 . 0 0 . 0 0 . 0  0 . 4 0 . 2 0 . 1 0 . 1 0 . 2      P = 0 . 0 0 . 6 0 . 0 0 . 0 0 . 4     0 . 0 0 . 6 0 . 0 0 . 0 0 . 4   0 . 0 0 . 0 0 . 5 0 . 5 0 . 0   0 . 36 0 . 32 0 . 08 0 . 08 0 . 16 0 . 16 0 . 48 0 . 12 0 . 12 0 . 12     P 2 =   0 . 24 0 . 12 0 . 26 0 . 26 0 . 12     0 . 24 0 . 12 0 . 26 0 . 26 0 . 12   0 . 00 0 . 60 0 . 00 0 . 00 0 . 40 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  59. transition probability matrix. …which is the product of row i by column j of the This is also the element ( i , j ) in the matrix P 2 ! Hence, P 2 gives all the transition probabilities moving from state E i to E j in two units of time (steps).   0 . 2 0 . 8 0 . 0 0 . 0 0 . 0  0 . 4 0 . 2 0 . 1 0 . 1 0 . 2      P = 0 . 0 0 . 6 0 . 0 0 . 0 0 . 4     0 . 0 0 . 6 0 . 0 0 . 0 0 . 4   0 . 0 0 . 0 0 . 5 0 . 5 0 . 0   0 . 36 0 . 32 0 . 08 0 . 08 0 . 16 0 . 16 0 . 48 0 . 12 0 . 12 0 . 12     P 2 =   0 . 24 0 . 12 0 . 26 0 . 26 0 . 12     0 . 24 0 . 12 0 . 26 0 . 26 0 . 12   0 . 00 0 . 60 0 . 00 0 . 00 0 . 40 ⇒ What are all those zeros? . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  60. “ n -steps” transition probabilities. In general , P n ( P to the n th power) gives all the   0 . 1974 0 . 3827 0 . 1280 0 . 1280 0 . 1638 0 . 1914 0 . 3894 0 . 1182 0 . 1182 0 . 1827     P 5 =  0 . 1536 0 . 4406 0 . 1085 0 . 1085 0 . 1888      0 . 1536 0 . 4406 0 . 1085 0 . 1085 0 . 1888   0 . 2304 0 . 2688 0 . 1688 0 . 1688 0 . 1632   0 . 1899 0 . 3797 0 . 1266 0 . 1266 0 . 1772 0 . 1899 0 . 3797 0 . 1266 0 . 1266 0 . 1772     P 25 =   0 . 1899 0 . 3797 0 . 1266 0 . 1266 0 . 1772     0 . 1899 0 . 3797 0 . 1266 0 . 1266 0 . 1772   0 . 1899 0 . 3797 0 . 1266 0 . 1266 0 . 1772 . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  61. p n p ik p n P n In three steps, we have, n times P P In other words, kj 1 P k ij and for n steps, kj k ij ∑ p ( 3 ) p ik p ( 2 ) = . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  62. P n In three steps, we have, ij n times P P In other words, kj k P and for n steps, kj k ij ∑ p ( 3 ) p ik p ( 2 ) = ∑ p ( n ) p ik p ( n − 1 ) = . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  63. In three steps, we have, k ij In other words, k kj kj and for n steps, n times ij ∑ p ( 3 ) p ik p ( 2 ) = ∑ p ( n ) p ik p ( n − 1 ) = P ( n ) = P × P × . . . × P � �� � . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  64. . Substitutions . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Markov Chains . PAM PAM Matrices Dayhofg, M., Schwartz, R. and Orcutt, B. (1978). A model of evolutionary change in protein. In Atlas of Protein Sequences and Structure , 5 , 345–352. PAM stands for “Point Accepted Mutation”, which is a mutation which not only has occurred but it has also been retained and has spread to the entire population (species). The PAM1 matrix is a Markov chain matrix corresponding to a period of time such that 1% of the amino acids have undergone a point accepted mutation . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  65. . Substitutions . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Markov Chains . PAM PAM Matrices Dayhofg, M., Schwartz, R. and Orcutt, B. (1978). A model of evolutionary change in protein. In Atlas of Protein Sequences and Structure , 5 , 345–352. PAM stands for “Point Accepted Mutation”, which is a mutation which not only has occurred but it has also been retained and has spread to the entire population (species). The PAM1 matrix is a Markov chain matrix corresponding to a period of time such that 1% of the amino acids have undergone a point accepted mutation . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  66. . Substitutions . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Preamble Signifjcance Models Markov Chains . PAM PAM Matrices Dayhofg, M., Schwartz, R. and Orcutt, B. (1978). A model of evolutionary change in protein. In Atlas of Protein Sequences and Structure , 5 , 345–352. PAM stands for “Point Accepted Mutation”, which is a mutation which not only has occurred but it has also been retained and has spread to the entire population (species). The PAM1 matrix is a Markov chain matrix corresponding to a period of time such that 1% of the amino acids have undergone a point accepted mutation . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  67. . Models . . . . . . . . . Preamble Signifjcance Substitutions . Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Margaret Dayhofg (1925–1983) Georgetown University Medical Center Professor, and Bioinformatics pioneer! Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  68. . Signifjcance Just like for the BLOSUM matrix, which is another PAM matrix: construction PAM Markov Chains Substitutions Models Preamble estimated from data . PAM Markov Chains Substitutions Models Signifjcance Preamble popular substitution scheme, the probabilities are The starting point is a collection ungapped multiple . important since substitutions matrices for longer period of Marcel Turcotte families . the above constraints, they were able to collect 71 With the low amount of data available at the time, and power. time will be derived from PAM1 by raising it the n th mutation had occurred at a given site , which is alignments. they wanted to avoid the possibility that more than one The choice of the cutofg was also dictated by the fact that difgerent from any other sequence . sequences in an alignment had to be no more than 15% substitution matrix ). Dayhofg et al. decided that all the that they can be reliably aligned (with a trivial The sequences have to be suffjciently close (homologues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  69. . Signifjcance Just like for the BLOSUM matrix, which is another PAM matrix: construction PAM Markov Chains Substitutions Models Preamble estimated from data . PAM Markov Chains Substitutions Models Signifjcance Preamble popular substitution scheme, the probabilities are The starting point is a collection ungapped multiple . important since substitutions matrices for longer period of Marcel Turcotte families . the above constraints, they were able to collect 71 With the low amount of data available at the time, and power. time will be derived from PAM1 by raising it the n th mutation had occurred at a given site , which is alignments. they wanted to avoid the possibility that more than one The choice of the cutofg was also dictated by the fact that difgerent from any other sequence . sequences in an alignment had to be no more than 15% substitution matrix ). Dayhofg et al. decided that all the that they can be reliably aligned (with a trivial The sequences have to be suffjciently close (homologues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  70. . Signifjcance Just like for the BLOSUM matrix, which is another PAM matrix: construction PAM Markov Chains Substitutions Models Preamble estimated from data . PAM Markov Chains Substitutions Models Signifjcance Preamble popular substitution scheme, the probabilities are The starting point is a collection ungapped multiple . important since substitutions matrices for longer period of Marcel Turcotte families . the above constraints, they were able to collect 71 With the low amount of data available at the time, and power. time will be derived from PAM1 by raising it the n th mutation had occurred at a given site , which is alignments. they wanted to avoid the possibility that more than one The choice of the cutofg was also dictated by the fact that difgerent from any other sequence . sequences in an alignment had to be no more than 15% substitution matrix ). Dayhofg et al. decided that all the that they can be reliably aligned (with a trivial The sequences have to be suffjciently close (homologues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  71. . Signifjcance Just like for the BLOSUM matrix, which is another PAM matrix: construction PAM Markov Chains Substitutions Models Preamble estimated from data . PAM Markov Chains Substitutions Models Signifjcance Preamble popular substitution scheme, the probabilities are The starting point is a collection ungapped multiple . important since substitutions matrices for longer period of Marcel Turcotte families . the above constraints, they were able to collect 71 With the low amount of data available at the time, and power. time will be derived from PAM1 by raising it the n th mutation had occurred at a given site , which is alignments. they wanted to avoid the possibility that more than one The choice of the cutofg was also dictated by the fact that difgerent from any other sequence . sequences in an alignment had to be no more than 15% substitution matrix ). Dayhofg et al. decided that all the that they can be reliably aligned (with a trivial The sequences have to be suffjciently close (homologues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  72. . Signifjcance Just like for the BLOSUM matrix, which is another PAM matrix: construction PAM Markov Chains Substitutions Models Preamble estimated from data . PAM Markov Chains Substitutions Models Signifjcance Preamble popular substitution scheme, the probabilities are The starting point is a collection ungapped multiple . important since substitutions matrices for longer period of Marcel Turcotte families . the above constraints, they were able to collect 71 With the low amount of data available at the time, and power. time will be derived from PAM1 by raising it the n th mutation had occurred at a given site , which is alignments. they wanted to avoid the possibility that more than one The choice of the cutofg was also dictated by the fact that difgerent from any other sequence . sequences in an alignment had to be no more than 15% substitution matrix ). Dayhofg et al. decided that all the that they can be reliably aligned (with a trivial The sequences have to be suffjciently close (homologues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics

  73. . PAM . . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Preamble . Signifjcance Models Substitutions Markov Chains PAM Phylogenetic trees From the sequences, phylogenetic trees are reconstructed. The method that they used is called maximum parsimony . It produces trees such that total number of substitutions across the whole tree is minimum. In the following trees, only one mutational event is necessary to explain the actual sequences: Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics . . . . . . . A B A s(B,A) A s(A,B) A s(A,B) A A B A A A B A A B

  74. . Models . . . . . . . . . Preamble Signifjcance Substitutions . Markov Chains PAM Preamble Signifjcance Models Substitutions Markov Chains PAM Phylogenetic trees (continued) On the other hand, the following tree necessitates 2 events, not minimum, therefore not the most parsimonious tree. Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics A s(A,B) B s(B,A) A B A

  75. The trees are such that the leaves are labeled with the actual (contemporary) sequences and the internal nodes are labeled with ancestral (reconstructed) sequences . Therefore, contemporary sequences are never compared directly. ... SAQ ... reconstructed ancestral sequences ... SDQ ... ... SDQ ... ... TDQ ... ... SAK ... actual sequences . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  76. when counting the number of substitutions, cells A i j and A j i are both incremented. The result is a matrix, A , such that A ij counts the . Preamble . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Substitutions Signifjcance Models . Markov Chains PAM Estimation and divided by the number of trees; if there are more than one “most parsimonious tree”. The likelihood of a substitution i to j is assumed to be the same as the likelihood of a substitution j to i . Therefore, number of observed substitutions from/to the amino acid type i to/from the amino acid type j . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Pairs ( i , j ) are counted for adjacent nodes in all the trees

  77. The result is a matrix, A , such that A ij counts the . Preamble . . . . . . Preamble Signifjcance Models Substitutions Markov Chains PAM Signifjcance . Models Substitutions Markov Chains PAM Estimation and divided by the number of trees; if there are more than one “most parsimonious tree”. The likelihood of a substitution i to j is assumed to be the same as the likelihood of a substitution j to i . Therefore, number of observed substitutions from/to the amino acid type i to/from the amino acid type j . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Pairs ( i , j ) are counted for adjacent nodes in all the trees when counting the number of substitutions, cells A i , j and A j , i are both incremented.

  78. . PAM . . . . . . . Preamble Signifjcance Models Substitutions Markov Chains Preamble . Signifjcance Models Substitutions Markov Chains PAM Estimation and divided by the number of trees; if there are more than one “most parsimonious tree”. The likelihood of a substitution i to j is assumed to be the same as the likelihood of a substitution j to i . Therefore, number of observed substitutions from/to the amino acid type i to/from the amino acid type j . Marcel Turcotte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CSI5126 . Algorithms in bioinformatics Pairs ( i , j ) are counted for adjacent nodes in all the trees when counting the number of substitutions, cells A i , j and A j , i are both incremented. The result is a matrix, A , such that A ij counts the

  79. Our task is to estimate the transition probabilities of the Markov chain matrix, the following quantity moves us one step closer, A ij a ij = ∑ k A ik C A(A,C) A(A,A) A D A(A,D) A(A,Y) ... Y . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  80. j p ij and 1 by defjnition. and p ii 1 k i c a ik i.e. p ii 1 k i p ik For reasons that will be explained in a moment, the a ij are scaled by a factor c . For i ̸ = j , let, p ij = c · a ij . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  81. j p ij 1 by defjnition. 1 and and p ik i k i.e. p ii For reasons that will be explained in a moment, the a ij are scaled by a factor c . For i ̸ = j , let, p ij = c · a ij ∑ p ii = 1 − c · a ik k ̸ = i . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  82. j p ij 1 by defjnition. i.e. and and p ik For reasons that will be explained in a moment, the a ij are scaled by a factor c . For i ̸ = j , let, p ij = c · a ij ∑ p ii = 1 − c · a ik k ̸ = i ∑ p ii = 1 − k ̸ = i . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  83. i.e. and p ik For reasons that will be explained in a moment, the a ij are scaled by a factor c . For i ̸ = j , let, p ij = c · a ij ∑ p ii = 1 − c · a ik k ̸ = i ∑ p ii = 1 − k ̸ = i and ∑ j p ij = 1 by defjnition. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  84. The expected proportion of the amino acids that will change after one unit of time is given by, i p i p ij where the frequency of occurrence of each amino acid type, p i , is estimated from the observed distribution found in the original data. ∑ ∑ j ̸ = i . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  85. p i c a ij i p i a ij The constant c is defjned such that the expected proportion of c j i 0 01 amino acid changes, after one unit of time, is 1% . i.e., p i a ij i j i c 0 01 i j i 0 01 p i p ij i ∑ ∑ 0 . 01 = j ̸ = i . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  86. i p i a ij The constant c is defjned such that the expected proportion of 0 01 j i 0 01 c amino acid changes, after one unit of time, is 1% . p i a ij i j i c i.e., p i p ij i i ∑ ∑ 0 . 01 = j ̸ = i ∑ ∑ 0 . 01 = p i c a ij j ̸ = i . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

  87. i p i a ij The constant c is defjned such that the expected proportion of i j i 0 01 c amino acid changes, after one unit of time, is 1% . p i a ij i.e., i p i p ij i ∑ ∑ 0 . 01 = j ̸ = i ∑ ∑ 0 . 01 = p i c a ij j ̸ = i ∑ ∑ 0 . 01 = c j ̸ = i . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

Recommend


More recommend