The number of occurrences of a word (5.7) and motif (5.9) in a DNA sequence, allowing overlaps Covariance (2.4) and indicators (2.9) Prof. Tesler Math 283 Fall 2016 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 1 / 24
Covariance Let X and Y be random variables, possibly dependent. Var ( X + Y ) = E (( X + Y − µ X − µ Y ) 2 ) ��� �� 2 � � � = E X − µ X + Y − µ Y �� � 2 � �� � 2 � � � = E X − µ X + E Y − µ Y + 2 E ( X − µ X )( Y − µ Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) where the covariance of X and Y is defined as � � Cov ( X , Y ) = E ( X − µ X )( Y − µ Y ) Expanding gives an alternate formula Cov ( X , Y ) = E ( XY ) − E ( X ) E ( Y ) : � � Cov ( X , Y ) = E ( X − µ X )( Y − µ Y ) = E ( XY ) − µ X E ( Y ) − µ Y E ( X ) + µ X µ Y = E ( XY ) − E ( X ) E ( Y ) Prof. Tesler # occurrences of a word Math 283 / Fall 2016 2 / 24
Covariance properties Cov ( X , X ) = Var ( X ) Cov ( X , Y ) = Cov ( Y , X ) If X , Y are independent then Cov ( X , Y ) = 0 and Var ( X + Y ) = Var ( X ) + Var ( Y ) . Beware, this is not reversible; Cov ( X , Y ) could be 0 for dependent variables. Cov ( aX + b , cY + d ) = ac Cov ( X , Y ) 2 � Var ( X 1 + X 2 + · · · + X n ) = Var ( X 1 )+ · · · + Var ( X n )+ Cov ( X i , X j ) 1 � i < j � n Sign of covariance When Cov ( X , Y ) is positive: there is a tendency to have X > µ X when Y > µ Y and vice-versa, and X < µ X when Y < µ Y and vice-versa. When Cov ( X , Y ) is negative: there is a tendency to have X > µ X when Y < µ Y and vice-versa, and X < µ X when Y > µ Y and vice-versa. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 3 / 24
Occurrences of a word in a sequence — notation Consider a (long) single-stranded nucleotide sequence τ = τ 1 . . . τ N and a (short) word w = w 1 . . . w k : τ = τ 1 . . . τ 19 = CTATAGATAGATAGACAGT w = w 1 . . . w 9 = ATAGATAGA Say w occurs in τ at position j when w is in τ ending at position j : j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j C T A T A G A T A G A T A G A C A G T so w occurs in τ at 11 and 15 (underlined). � if w occurs in τ at j ; I 11 = I 15 = 1 1 Let I j = otherwise. other I j = 0 0 I j is an indicator variable (1 when a condition is true, 0 when false). Y = I k + I k + 1 + · · · + I N is the number of times w occurs in τ . Here, Y = 2 . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 4 / 24
Computing mean number of occurrences µ = E ( Y ) Suppose τ is generated by N independent rolls of a 4-sided die, whose sides have probabilities p A , p C , p G , p T adding up to 1. The probability of a word being generated by rolling such a die is the product of the probabilities of its nucleotides: π ( ATAGATAGA ) = p A 5 p T 2 p G 2 π ( w ) = p w 1 · · · p w k The probability of w occurring at j = k , k + 1 , . . . , N is π ( w ) . I j ’s are indicator variables, so E ( I j ) = 0 P ( I j = 0 ) + 1 P ( I j = 1 ) = P ( I j = 1 ) = π ( w ) for j = k , k + 1 , . . . , N . Y = I k + I k + 1 + · · · + I N so the mean number of occurrences is µ = E ( Y ) = E ( I k ) + · · · + E ( I N ) = ( N − k + 1 ) π ( w ) . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 5 / 24
Dependencies between positions Occurrences at different positions have dependencies, because of how shifts of w may overlap with each other. w = ATAGATAGA cannot occur at both 14 and 15: j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j A T A G A T A G A A T A G A T A G A But w can occur at both 11 and 15 . j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 τ j C T A T A G A T A G A T A G A C A G T This is equivalent to w 1 . . . w k w r + 1 . . . w k = w 1 . . . w 9 w 6 . . . w 9 = ATAG ATAGA TAGA occurring at 15 , where k = 9 is the word length and r = 5 is the overlap length. Chapter 5.8 considers counting occurrences without overlaps. Chapters 4 and 11 do the more general problem of Markov chains. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 6 / 24
Self-overlaps of a word Define if the first r letters of w equal the last r letters 1 of w in the exact same order (string equality); ε r = otherwise. 0 This lets us account for dependencies between I j and I j + k − r . Shifting by k − r positions corresponds to an overlap of size r . w : A T A G A T A G A r = 9 ε 9 = 1 A T A G A T A G A r = 8 ε 8 = 0 A T A G A T A G A r = 7 ε 7 = 0 A T A G A T A G A r = 6 ε 6 = 0 A T A G A T A G A r = 5 ε 5 = 1 A T A G A T A G A r = 4 ε 4 = 0 A T A G A T A G A r = 3 ε 3 = 0 A T A G A T A G A r = 2 ε 2 = 0 A T A G A T A G A r = 1 ε 1 = 1 A T A G A T A G A Prof. Tesler # occurrences of a word Math 283 / Fall 2016 7 / 24
Computing σ 2 = Var ( Y ) Since the I j ’s have dependencies, the variance of their sum Y = I k + · · · + I N is NOT necessarily the sum of their variances. We must consider covariance terms as well: N � � Var ( Y ) = Var ( I j ) + Cov ( I j , I ℓ ) 2 j = k j , ℓ : k � j <ℓ � N First sum: Note that I j 2 = I j since I j = 0 or 1 , so Var ( I j ) = E ( I j 2 ) − ( E ( I j )) 2 = π ( w ) − π ( w ) 2 and the first sum in Var ( Y ) is N � Var ( I j ) = ( N − k + 1 )( π ( w ) − π ( w ) 2 ) j = k Second sum: next few slides. Prof. Tesler # occurrences of a word Math 283 / Fall 2016 8 / 24
2 � Covariances Cov ( I j , I ℓ ) j , ℓ : k � j <ℓ � N The covariances sum is complicated: If ℓ − j � k then I j , I ℓ are independent and Cov ( I j , I ℓ ) = 0 . If 0 < ℓ − j < k , the words ending at ℓ and j overlap by r = k − ( ℓ − j ) letters. Rewrite ℓ as ℓ = j + k − r : Cov ( I j , I ℓ ) = Cov ( I j , I j + k − r ) = E ( I j I j + k − r ) − E ( I j ) E ( I j + k − r ) I j I j + k − r = 1 iff w 1 . . . w k w r + 1 . . . w k occurs at position j + k − r in τ . E.g., w 1 . . . w k w r + 1 . . . w k = w 1 . . . w 9 w 6 . . . w 9 = ATAG ATAGA TAGA . E ( I j I j + k − r ) = ε r · π ( w 1 . . . w k w r + 1 . . . w k ) . Cov ( I j , I j + k − r ) = E ( I j I j + k − r ) − E ( I j ) E ( I j + k − r ) = ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 . Note that this depends on r but not j . Prof. Tesler # occurrences of a word Math 283 / Fall 2016 9 / 24
2 � Covariances Cov ( I j , I ℓ ) j , ℓ : k � j <ℓ � N The covariance sum becomes k − 1 N − k + r � � � ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 � � Cov ( I j , I ℓ ) = j , ℓ : k � j <ℓ � N r = 1 j = k k − 1 � ε r · π ( w 1 . . . w k w r + 1 . . . w k ) − ( π ( w )) 2 � � = ( N − 2 k + r + 1 ) r = 1 � k − 1 � � = ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 � (( N − 2 k + 2 ) + ( N − k ))( k − 1 ) � ( π ( w )) 2 − 2 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 10 / 24
Mean and variance of number of occurrences Combining all the parts together and simplifiying gives Mean number of occurrences E ( Y ) = ( N − k + 1 ) E ( I k ) = ( N − k + 1 ) π ( w ) Variance of number of occurrences ( 2 k − 1 ) N − 3 k 2 + 4 k − 1 ( π ( w )) 2 � � Var ( Y ) = ( N − k + 1 ) π ( w ) − k − 1 � + 2 ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 11 / 24
Computation for w = w 1 . . . w 9 = ATAGATAGA ( k = 9 ) over all τ of length N p A 5 p T 2 p G 2 π ( w ) = and w self-overlaps at r = 1 , 5 ( N − k + 1 ) π ( w ) = ( N − 8 ) π ( w ) = ( N − 8 ) p A 5 p T 2 p G 2 E ( Y ) = ( 2 k − 1 ) N − 3 k 2 + 4 k − 1 ( π ( w )) 2 � � Var ( Y ) = ( N − k + 1 ) π ( w ) − k − 1 � + 2 ε r · ( N − 2 k + r + 1 ) π ( w 1 . . . w k w r + 1 . . . w k ) r = 1 ( N − 8 ) π ( w ) − ( 17 N − 208 )( π ( w )) 2 = + 2 ( N − 16 ) π ( ATAGATAG A TAGATAGA ) + 2 ( N − 12 ) π ( ATAG ATAGA TAGA ) ( N − 8 ) p A 5 p T 2 p G 2 − ( 17 N − 208 ) p A 10 p T 4 p G 4 = + 2 ( N − 2 k + 2 ) p A 9 p G 4 p T 4 + 2 ( N − 2 k + 6 ) p A 7 p G 3 p T 3 Prof. Tesler # occurrences of a word Math 283 / Fall 2016 12 / 24
Frequencies of words and motifs in SARS The genome of SARS described previously has N = 29751 bases: Nucleotide Frequency Proportion p A ≈ 0 . 2851 8481 A p C ≈ 0 . 1997 5940 C p G ≈ 0 . 2080 6187 G p T ≈ 0 . 3073 9143 T Total N = 29751 1 These were used below to compute "Estimated" µ and σ . “Observed frequency” y was determined from the DNA sequence. Word Estimated Observed y = Freq. z = ( y − µ ) /σ Φ ( z ) µ σ 104 . 5456 10 . 6943 0 . 1360 0 . 5541 106 GAGA 10 − 5 73 . 2226 8 . 4830 − 4 . 2700 37 GCGA 78 . 9381 8 . 8018 − 2 . 2652 0 . 0118 59 TGCG 10 − 3 motif M 256 . 7064 17 . 6583 − 3 . 0980 202 ( M consists of all three words; details on computing µ , σ are later.) Prof. Tesler # occurrences of a word Math 283 / Fall 2016 13 / 24
Recommend
More recommend