an efficient algorithm for an efficient algorithm for
play

An Efficient Algorithm for An Efficient Algorithm for Simulating - PowerPoint PPT Presentation

An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating Coalescence with Recombination Recombination Katy L. Simonsen (Statistics) (Statistics) Katy L. Simonsen Dan A. Noland (CS and ITaP ITaP) ) Dan


  1. An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating Coalescence with Recombination Recombination Katy L. Simonsen (Statistics) (Statistics) Katy L. Simonsen Dan A. Noland (CS and ITaP ITaP) ) Dan A. Noland (CS and Chinh Le ( Le (ITaP ITaP), ), Chinh Purdue University Purdue University Katy L. Simonsen Interface 2004 1 Katy L. Simonsen Interface 2004 1

  2. Coalescence with Coalescence with Recombination Recombination � The coalescent process is a Markov Chain � The coalescent process is a Markov Chain process which models the ancestry of a process which models the ancestry of a random sample of DNA sequences from a random sample of DNA sequences from a population population � Recombination allows genetic information � Recombination allows genetic information from two separate individuals to be from two separate individuals to be combined on a single chromosome combined on a single chromosome � Tree � Tree- -like structure describes possibly like structure describes possibly different ancestry at different loci different ancestry at different loci Katy L. Simonsen Interface 2004 2 Katy L. Simonsen Interface 2004 2

  3. Model Parameters Model Parameters Parameters: Parameters: � n � n = 4 sequences = 4 sequences � m � m = 3 loci = 3 loci � recombination rate � recombination rate r i < 0.5 r i < 0.5 � effective � effective population size N population size N � R � R i = 2Nr i for i = 1 i = 2Nr i for i = 1 … m- -1 1 … m Katy L. Simonsen Interface 2004 3 Katy L. Simonsen Interface 2004 3

  4. Why simulate coalescence? Why simulate coalescence? � Simulating CWR � Simulating CWR � allows the generation of large samples of DNA allows the generation of large samples of DNA � sequences with linkage sequences with linkage � statistical properties can empirically be determined statistical properties can empirically be determined � � Realistic levels of linkage disequilibrium are � Realistic levels of linkage disequilibrium are needed to discover effective strategies for needed to discover effective strategies for mapping genes in natural populations mapping genes in natural populations � Recently popular in human genetics literature � Recently popular in human genetics literature Katy L. Simonsen Interface 2004 4 Katy L. Simonsen Interface 2004 4

  5. Original algorithm Original algorithm (Simonsen Simonsen & Churchill, 1997) & Churchill, 1997) ( � steps through Markov Chain to build tree � steps through Markov Chain to build tree � series of recombination and coalescent events series of recombination and coalescent events � � To calculate transition probabilities we need � To calculate transition probabilities we need � � the number of each type of individual present the number of each type of individual present m possible types � there are 2 there are 2 m possible types � � the probability that each type has a recombination event the probability that each type has a recombination event � � Stores a state vector Y of length 2 � Stores a state vector Y of length 2 m m � Impossible to go beyond Impossible to go beyond m m = 32 (or 64) = 32 (or 64) � � Can’t even store the indices of the vector let alone the vector! Can’t even store the indices of the vector let alone the vector! � � Limited by exponential nature � Limited by exponential nature � Could not simulate more than Could not simulate more than m m = 24 loci = 24 loci � � Very slow, especially for large Very slow, especially for large m m and and R R � Katy L. Simonsen Interface 2004 5 Katy L. Simonsen Interface 2004 5

  6. Goal Goal � Want to be able to simulate hundreds, � Want to be able to simulate hundreds, even thousands of loci ( m m ) ) even thousands of loci ( � Also want large � Also want large R R = 2Nr where = 2Nr where 3 – � N is very large (10 N is very large (10 3 – 10 10 6 ), 6 ), � � 0 < r < 0.5 0 < r < 0.5 � � as � as m m increases, increases, r r decreases in such a decreases in such a way that mr mr = constant (genome size) = constant (genome size) way that Katy L. Simonsen Interface 2004 6 Katy L. Simonsen Interface 2004 6

  7. What is taking so much space? What is taking so much space? � Vector � Vector Y Y changes at each step changes at each step m × � Matrix � Matrix Λ Λ (0 (0/1 /1 indicator) size 2 indicator) size 2 m × m m – –1 (fixed) 1 (fixed) � whether recombination can happen for each type whether recombination can happen for each type � Λ R R at every step at every step � Need all entries of � Need all entries of Y Y Λ � to calculate transition probabilities for next step to calculate transition probabilities for next step � � number of steps in the chain is random: � number of steps in the chain is random: � G G = 2H + = 2H + n n – – 1 where r.v. H depends on 1 where r.v. H depends on m m , , n n , , R R � � End result is a tree with End result is a tree with G G “ “events events” ” � � Have to store all this: takes � Have to store all this: takes memory memory � Have to calculate all this: takes � Have to calculate all this: takes time time Katy L. Simonsen Interface 2004 7 Katy L. Simonsen Interface 2004 7

  8. How can we fix it? How can we fix it? � Don � Don’ ’t want to store t want to store Y Y � Don � Don’ ’t want to store t want to store Λ Λ R R � Don � Don’ ’t want to do all that multiplication t want to do all that multiplication � Don � Don’ ’t want to store the big tree with t want to store the big tree with G G events events Katy L. Simonsen Interface 2004 8 Katy L. Simonsen Interface 2004 8

  9. Exploit structure: Y Y Exploit structure: � Y � Y contains integers in [0, contains integers in [0, n n ] ] � Sum of entries in � Sum of entries in Y Y is is ≤ ≤ nm nm � At most nm non � At most nm non- -zero elements out of zero elements out of 2 2 m m � Y � Y starts with one non starts with one non- -zero element (= zero element (= n n ) and ) and ends with one non- -zero element (=1) zero element (=1) ends with one non � Y � Y must be very sparse! must be very sparse! � At most 3 entries change at any one step � At most 3 entries change at any one step � Can store � Can store Y Y cleverly to exploit this structure cleverly to exploit this structure Katy L. Simonsen Interface 2004 9 Katy L. Simonsen Interface 2004 9

  10. Exploit structure: Λ Λ R R Exploit structure: � Λ � Λ turns out not to be sparse turns out not to be sparse � Indicates where recombination is relevant (0 Indicates where recombination is relevant (0 – – 1) 1) � � Looks kind of sparse for small Looks kind of sparse for small m m � � Becomes more dense with Becomes more dense with m m , and is completely , and is completely � dense in the limit (Lei Liu) dense in the limit (Lei Liu) � Can Can’ ’t get away with clever storage here t get away with clever storage here � � BUT since � BUT since Y Y multiplies multiplies Λ Λ R R we don we don’ ’t need it all t need it all � Only need rows of Only need rows of Λ Λ R R with >0 entries of with >0 entries of Y Y � � Will recalculate � Will recalculate Λ Λ R R “ “on the fly on the fly” ” as entries of as entries of Y Y become >0 become >0 � Seems inefficient but trades recalculation for space Seems inefficient but trades recalculation for space � � Use bitwise calculations for speed Use bitwise calculations for speed � Katy L. Simonsen Interface 2004 10 Katy L. Simonsen Interface 2004 10

  11. Key data representations: Y Y Λ Λ R R Key data representations: � Store � Store Y Y as a linked list as a linked list � only the non only the non- -zero entries zero entries � � Along with each element of � Along with each element of Y Y , store its , store its corresponding Λ Λ R R row row corresponding � Store index of � Store index of Y Y (type) as a multi (type) as a multi- -precision precision integer since this is still of length 2 m m integer since this is still of length 2 � use the Gnu use the Gnu Multiprecision Multiprecision ( (gmp gmp) Library (Dan ) Library (Dan’ ’s idea) s idea) � � At each step, only recalculate � At each step, only recalculate Λ Λ R R row for row for “ “new new” ” Y entries entries Y Katy L. Simonsen Interface 2004 11 Katy L. Simonsen Interface 2004 11

  12. Exploit structure: Trees Exploit structure: Trees � Trees are � Trees are complicated when complicated when mingled but simple if mingled but simple if looked at one by one looked at one by one � Store the � Store the “ “marginal marginal” ” trees instead of the trees instead of the “joint joint” ” tree: tree: “ Katy L. Simonsen Interface 2004 12 Katy L. Simonsen Interface 2004 12

Recommend


More recommend