markov chains and the number of occurrences of a word in
play

Markov chains and the number of occurrences of a word in a sequence - PowerPoint PPT Presentation

Markov chains and the number of occurrences of a word in a sequence (4.54.9, 11.1,2,4,6) Prof. Tesler Math 283 Fall 2018 Prof. Tesler Markov Chains Math 283 / Fall 2018 1 / 44 Locating overlapping occurrences of a word Consider a


  1. Markov chains and the number of occurrences of a word in a sequence (4.5–4.9, 11.1,2,4,6) Prof. Tesler Math 283 Fall 2018 Prof. Tesler Markov Chains Math 283 / Fall 2018 1 / 44

  2. Locating overlapping occurrences of a word Consider a (long) single-stranded nucleotide sequence τ = τ 1 . . . τ N and a (short) word w = w 1 . . . w k , e.g., w = GAGA . for i = 1 to N-3 { if ( τ i τ i + 1 τ i + 2 τ i + 3 == GAGA) { ... } } The above scan takes up to ≈ 4 N comparisons to locate all occurrences of GAGA ( kN comparisons for w of length k ). A finite state automaton (FSA) is a “machine” that can locate all occurrences while only examining each letter of τ once . Prof. Tesler Markov Chains Math 283 / Fall 2018 2 / 44

  3. Overlapping occurrences of GAGA M 1 A,C,T G G A G A 0 G GA GAG GAGA C,T G A,C,T G C,T A,C,T The states are the nodes ∅ , G , GA , GAG , GAGA (prefixes of w ). For w = w 1 w 2 · · · w k , there are k + 1 states (one for each prefix). Start in the state ∅ (shown on figure as 0 ). Scan τ = τ 1 τ 2 . . . τ N one character at a time left to right. Transition edges: When examining τ j , move from the current state to the next state according to which edge τ j is on. For each node u = w 1 · · · w r and each letter x = A , C , G , T , determine the longest suffix s (possibly ∅ ) of w 1 · · · w r x that’s among the states. x −→ s Draw an edge u The number of times we are in the state GAGA is the desired count of number of occurrences. Prof. Tesler Markov Chains Math 283 / Fall 2018 3 / 44

  4. Overlapping occurrences of GAGA in τ = CAGAGGTCGAGAGT... M 1 A,C,T G G A G A 0 G GA GAG GAGA C,T G A,C,T G C,T A,C,T t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 τ t C A G A G G T C G A G A G T ... Time t State at t τ t Time t State at t τ t 1 9 0 C 0 G 2 10 0 A G A 3 11 0 G GA G 4 12 G A GAG A 5 13 GA G GAGA G 6 14 GAG G GAG T · · · 7 15 G T 0 8 0 C Prof. Tesler Markov Chains Math 283 / Fall 2018 4 / 44

  5. Non-overlapping occurrences of GAGA M 1 = A,C,T G G A G A 0 G GA GAG GAGA C,T G A,C,T G C,T A,C,T A,C,T G M 2 = G A G A 0 G GA GAG GAGA C,T A,C,T G G C,T A,C,T For non-overlapping occurrences of w : Replace the outgoing edges from w by copies of the outgoing edges from ∅ . G On previous slide, the time 13 → 14 transition GAGA −→ GAG G −→ G . changes to GAGA Prof. Tesler Markov Chains Math 283 / Fall 2018 5 / 44

  6. Motif { GAGA , GTGA } , overlaps permitted C A,C,T A,C,T A,C,T G G A C GA GAG GAGA A G G 0 G G T C T G A,C,T G T G A GT GTG GTGA A,C,T States: All prefixes of all words in the motif. If a prefix occurs multiple times, only create one node for it. Transition edges: they may jump from one word of the motif to another. G −→ GAG . GTGA Count the number of times we reach the states for any words in the motif ( GAGA or GTGA ). Prof. Tesler Markov Chains Math 283 / Fall 2018 6 / 44

  7. Markov chains A Markov chain is similar to a finite state machine, but incorporates probabilities. Let S be a set of “states.” We will take S to be a discrete finite set, such as S = { 1 , 2 , . . . , s } . Let t = 1 , 2 , . . . denote the “time.” Let X 1 , X 2 , . . . denote a sequence of random variables, values ∈ S . The X t ’s form a (first order) Markov chain if they obey these rules The probability of being in a certain state at time t + 1 only 1 depends on the state at time t , not on any earlier states: P ( X t + 1 = x t + 1 | X 1 = x 1 , . . . , X t = x t ) = P ( X t + 1 = x t + 1 | X t = x t ) The probability of transitioning from state i at time t to state j at 2 time t + 1 only depends on i and j , but not on the time t : P ( X t + 1 = j | X t = i ) = p ij at all times t for some values p ij , which form an s × s transition matrix . Prof. Tesler Markov Chains Math 283 / Fall 2018 7 / 44

  8. Transition matrix The transition matrix , P 1 , of the Markov chain M 1 is From state To state 1 2 3 4 5     1: 0 p A + p C + p T 0 0 0 p G P 11 P 12 P 13 P 14 P 15 2: G p C + p T p G p A 0 0 P 21 P 22 P 23 P 24 P 25         3: GA p A + p C + p T = 0 0 p G 0 P 31 P 32 P 33 P 34 P 35         4: GAG p C + p T p G 0 0 p A P 41 P 42 P 43 P 44 P 45     5: GAGA p A + p C + p T 0 0 p G 0 P 51 P 52 P 53 P 54 P 55 Notice that the entries in each row sum up to p A + p C + p G + p T = 1 . A matrix with all entries � 0 and all row sums equal to 1 is called a stochastic matrix . The transition matrix of a Markov chain is always stochastic. All row sums = 1 can be written   1 . P � 1 = � where � 1 = . 1 .   1 so � 1 is a right eigenvector of P with eigenvalue 1 . Prof. Tesler Markov Chains Math 283 / Fall 2018 8 / 44

  9. Transition matrices for GAGA M 1 P 1 pA+pC+pT pG  3 / 4 1 / 4  0 0 0 pG pA pG pA 0 G GA GAG GAGA pC+pT pG 1 / 2 1 / 4 1 / 4 0 0   pA+pC+pT pG 3 / 4 0 0 1 / 4 0   pC+pT   1 / 2 1 / 4 0 0 1 / 4 pA+pC+pT   3 / 4 1 / 4 0 0 0 M 2 P 2 pA+pC+pT pG pG pA pG pA 0 G GA GAG GAGA pC+pT   3 / 4 1 / 4 0 0 0 pA+pC+pT pG pG 1 / 2 1 / 4 1 / 4 0 0     3 / 4 1 / 4 0 0 0     1 / 2 1 / 4 1 / 4 0 0 pC+pT   3 / 4 1 / 4 0 0 0 pA+pC+pT Edge labels are replaced by probabilities, e.g., p C + p T . The matrices are shown for the case that all nucleotides have equal probabilities 1 / 4 . P 2 (no overlaps) is obtained from P 1 (overlaps allowed) by replacing the last row with a copy of the first row. Prof. Tesler Markov Chains Math 283 / Fall 2018 9 / 44

  10. Other applications of automata Automata / state machines are also used in other applications in Math and Computer Science. The transition weights may be defined differently, and the matrices usually aren’t stochastic. Combinatorics: Count walks through the automaton (instead of x getting their probabilities) by setting transition weights u −→ s to 1 . Computer Science (formal languages, classifiers, . . . ): Does the string τ contain GAGA ? Output 1 if it does, 0 otherwise. Modify M 1 : remove the outgoing edges on GAGA . On reaching state GAGA , terminate with output 1. If the end of τ is reached, terminate with output 0. This is called a deterministic finite acceptor (DFA). Markov chains: Instead of considering a specific string τ , we’ll compute probabilities, expected values, . . . over the sample space of all strings of length n . Prof. Tesler Markov Chains Math 283 / Fall 2018 10 / 44

  11. Other Markov chain examples A Markov chain is k th order if the probability of X t = i depends on the values of X t − 1 , . . . , X t − k . It can be converted to a first order Markov chain by making new states that record more history. Positional independence : Instead of a null hypothesis that a DNA sequence is generated by repeated rolls of a biased four-sided die, we could use a Markov chain. The simplest is a one-step transition matrix   p AA p AC p AG p AT p CA p CC p CG p CT P =   p GA p GC p GG p GT   p TA p TC p TG p TT P could be the same at all positions. In a coding region, it could be different for the first, second, and third positions of codons. Nucleotide evolution: There are models of random point mutations over the course of evolution concerning Markov chains with the form P (same as above) in which X t is the state A , C , G , T of the nucleotide at a given position in a sequence at time (generation) t . Prof. Tesler Markov Chains Math 283 / Fall 2018 11 / 44

  12. Questions about Markov chains What is the probability of being in a particular state after n steps? 1 What is the probability of being in a particular state as n → ∞ ? 2 What is the “reverse” Markov chain? 3 If you are in state i , what is the expected number of time steps 4 until the next time you are in state j ? What is the variance of this? What is the complete probability distribution? Starting in state i , what is the expected number of visits to state j 5 before reaching state k ? Prof. Tesler Markov Chains Math 283 / Fall 2018 12 / 44

  13. Transition probabilities after two steps t + 1 t + 2 Time t 1 P 1 j P i 1 P 2 j 2 P i 2 P 3 j P i 3 j i 3 P 4 j P i 4 4 P 5 j P i 5 5 To compute the probability for going from state i at time t to state j at time t + 2 , consider all the states it could go through at time t + 1 : � P ( X t + 2 = j | X t = i ) = r P ( X t + 1 = r | X t = i ) P ( X t + 2 = j | X t + 1 = r , X t = i ) � = r P ( X t + 1 = r | X t = i ) P ( X t + 2 = j | X t + 1 = r ) � r P ir P rj = ( P 2 ) ij = Prof. Tesler Markov Chains Math 283 / Fall 2018 13 / 44

  14. Transition probabilities after n steps For n � 0 , the transition matrix from time t to time t + n is P n : � P ( X t + n = j | X t = i ) = P ( X t + 1 = r 1 | X t = i ) P ( X t + 2 = r 2 | X t + 1 = r 1 ) · · · r 1 ,..., r n − 1 � P i r 1 P r 1 r 2 · · · P r n − 1 j = ( P n ) ij = r 1 ,..., r n − 1 (sum over possible states r 1 , . . . , r n − 1 at times t + 1 , . . . , t + ( n − 1 ) ) Prof. Tesler Markov Chains Math 283 / Fall 2018 14 / 44

Recommend


More recommend