From Poisson Approximations to the Blueprint of Life Ming-Ying Leung Division of Mathematics and Statistics University of Texas at San Antonio San Antonio, TX 78249 Outline: • DNA sequence Analysis • Scan Statistics • Poisson Type Approximations • Herpes Genomes
I n 1940, Avery announced his f inding t hat a macromolecule DNA inside t he chromosome is responsible f or t ransmit t ing heredit y mat erial f rom parent s t o of f spring.
I n 1953, Wat son and Crick conf irmed t he double helical st ruct ure of a DNA molecule.
The AGE of Molecular genet ics begins… With the rapid accumulation of genetics sequence data, mathematical, statistical, and computational methods play an increasingly important role in molecular biology. This leads to the birth of a new interdisciplinary field of study called Bioinformatics.
Challenges in Bioinformatics Sequence Assembly Database Technology Gene Finding Structure Prediction Molecular Evolution Locating extragenic functional sites
The f our dif f erent nucleot ide bases of DNA: Adenine (A) Thymine (T) Cyt osine (C) Guanine (G)
NETFETCH of: query May 5, 2000 12:57 from server: www.ncbi.nlm.nih.gov 1 Sequences Requested 1 Sequences Returned LOCUS BHV1CGEN 135301 bp DNA VRL 07-APR-2000 DEFINITION Bovine herpesvirus type 1.1 complete genome. ACCESSION AJ004801 VERSION AJ004801.1 GI:2653291 KEYWORDS complete genome. SOURCE Bovine herpesvirus type 1.1. ORGANISM Bovine herpesvirus type 1.1 Viruses; dsDNA viruses, no RNA stage; Herpesviridae; Alphaherpesvirinae; Varicellovirus. References … FEATURES … sequence ggcccagcccccgcgcggggggcgcggagaaaaaaaaaattttttccgcgcggcgcgtgc attgcggcgggcgggggcggggtgggggatgggcgcggagcgcgagggtagggttggcac actgccaagatcaccaagcatgtgcgcggccatcttgcttccaaactcattagcataccc cgcccattattccattctcatttgcatacccaccgttgcacatgccgccatattgctcct cctccctcgctcctcctccctcgctcctcctccctcgctcctcctccctcgctcctcctc cctcgctcctcctccctcgctcctcctccctcgctcctcctccctcgctcctcctccctc gctcctcctccctcgctcctcctccctcgctcctcctccctcgctcctcctccctcgctc ctcctccctcgctcctcttcaaaacactaccgcgggcgtccgctctcactagcttcggcg ccgtcatgggtgcccgcgcctccgcgcctgctgccggcccgcccccagcccacgctgttc tactagatgcgctctccgggggcacgattgacctgcctggcggcgacgaggccgtctttg tgtcctgcccgacgacgcgccccgtgtaccaccacatgcgccgcggccgcacggcccaca ctacacccgtgcacttcgttggccgcgcctatgccatcttgccctgccgcaagtttatgc tgtatctgatgcgcggtggtgccgtttacggctacgagcccaccactggcctgcaccgcc tcgccgattcactgcacgactttcttactactgccggactacagcagcgagacctacact gcctcgatgtcacggtgcttgacgcgcagatggacccggtgacgttcaccacccccgaga tcctcatcgagctcgaggcggacccggccttcccaccgccgccctcggcccgcgcgcgcc gctccacgctgcgccgggcgtctatgcgccggcccgcacgcaccttctgcccccaccagc tgctagcagagggctccattctggacctctgctcgccagagcaagcggcggcgccgggct gttcgctgctccccgcctgtgactctggagacgccgcgtgcccctgcgacgctggcgaga
Arrows indicat e t he direct ion of t he elongat ion of new st rands (shaded) f rom 5' t o 3' end.
Palindrome: A stretch of DNA that reads the same in both the direct and the complementary strands. E.g., 5’ ….. GCAATATTGC …..3’ 3’ ….. CGTTATAACG …..5’ • Short palindromes occur frequently by chance. To screen out the random noise, focus only on palindromes of length ≥ 10. • Significant clusters of palindromes are found around origins of replication and regulatory regions in viral genomes (Masse et al. 1992, Leung and Yamashita 1999). • Modeling the occurrence of palindromes on the DNA sequence as points on the unit interval, the scan statistic can be used to detect the presence of nonrandom palindrome clusters.
Sliding Window Plot Figure 1 : A sliding window plot is generated by choosing a window of fixed length and sliding it along the genome, beginning at the first base and continuing until the window reaches the last base of the genome. The window moves forward in steps of a pre-specified size. At each position of the window, the number of palindromes contained in it is counted and plotted against the window position. This is the sliding window plot for the human cytomegalovirus genome with a 1000 base window and step size of 500 bases. The peaks observed at window positions 92001 and 194501 suggest that there may be nonrandom palindrome clusters at these locations.
The Scan Statistic Notation U 1 , U 2 , …, U n ∼ i.i.d. Uniform (0,1) U (1) , U (2) , …, U ( n ) their order statistics S i = U ( i +1) - U ( i ) = i th spacing N w ( i ) = no. of points contained in [ U ( i ) , U ( i ) + w ] A r ( i ) = S i + … + S i + r -1 = sum of r adjoining spacing
The w- and r- Scan Statistics w -Scan Statistic w = N N ( i ) max w i r- Scan Statistic r = A A ( i ) min r i Duality Relationship ≥ + = ≤ { N ( i ) r 1 } { A ( i ) w } w r If N w is too big, or equivalently, A r is too small, an unusual cluster is present. The probability distribution of either N w or A r will help determine which clusters are unlikely to occur by chance.
Poisson Approximation Dembo and Karlin (1992) derive the limiting distribution x − r > = x / r ! lim P A e + r 1 1 / r → ∞ n n This follows from a Poisson limiting distribution for the counts C r of those A r ( i )'s not exceeding x / n 1+1/ r . If the above limiting distribution is used as an approximation for large n , one can easily obtain a critical value − α r ! ln( 1 ) = − c + r 1 n for A r below which a significant cluster is considered present.
Better approximate probabilities for A r can be derived from better approximate distributions of C r : • Finite Poisson approximation (Dembo & Karlin 1992). • Local declumping approximation (Glaz 1994) based on a declumping idea put forth by Arratia et al. (1990). • Compound Poisson approximation (Glaz 1994) based on a coupling method proposed by Roos (1993). Recursive algorithm for computing scan statistic probabilities to any desired degree of accuracy developed by Huffer and Lin (1998).
Herpesvirus Genomes Genome Palindromes Genome Length HCMV 296 229,354 EBV 113 172,282 HSE 194 150,223 HSI 111 134,226 HSS 131 112,930 HSV1 220 152,260 VZV 122 124,885
Significant Palindrome Clusters on Positions of significant ( α = 0.05 ) clusters r 1 None 2 None 3 None 4 92526 92570 92643 92701 195032 195112 5 92526 92570 92643 195032 6 92526 92570 92643 7 92526 92570 92643 8 91953 92526 92570 92643 92701 9 91490 91953 92526 92570 92643 10 91490 91637 91953 92526 92570 11 91490 91637 91953 92526 92570 12 91490 91637 91953 92526 13 91490 91637 91953 14 91490 15 91490
Regions of the herpes genomes with statistically significant clusters Genome Cluster Location Biological Feature HCMV 91490-92643 Origin of replication (oriLyt) 190532-195112 Transcriptional regulator HSV1 129511 Transcriptional regulator 146228 Origin of replication (ori S ) EBV 52787-53311 Origin of replication (OriLyt) 85174 HSE 115125-115893 144717-146485 HSS 112418-112422 109081-109238 VZV 1542
The Q-Q plot Figure 2: Q-Q plot for the palindrome positions of the human cytomegalovirus against quantiles of the uniform distribution. Here, we focus only on those palindromes with length ≥ 10 bases in order to screen out the random noise generated by frequent fortuitous occurrences of very short palindromes (see Leung et al. 1994 for a full explanation). The overall straight line appearance of the Q-Q plot suggests that it would be reasonable to model the occurrences of palindromes above a prescribed length along the genome sequence as i.i.d. points uniformly distributed over (0,1) and evaluate the significance of palindrome clusters with the scan statistic distribution.
Recommend
More recommend