An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating Coalescence with Recombination Recombination Katy L. Simonsen (Statistics) (Statistics) Katy L. Simonsen Dan A. Noland (CS and ITaP ITaP) ) Dan A. Noland (CS and Chinh Le ( Le (ITaP ITaP), ), Chinh Purdue University Purdue University Katy L. Simonsen Interface 2004 1 Katy L. Simonsen Interface 2004 1
Coalescence with Coalescence with Recombination Recombination � The coalescent process is a Markov Chain � The coalescent process is a Markov Chain process which models the ancestry of a process which models the ancestry of a random sample of DNA sequences from a random sample of DNA sequences from a population population � Recombination allows genetic information � Recombination allows genetic information from two separate individuals to be from two separate individuals to be combined on a single chromosome combined on a single chromosome � Tree � Tree- -like structure describes possibly like structure describes possibly different ancestry at different loci different ancestry at different loci Katy L. Simonsen Interface 2004 2 Katy L. Simonsen Interface 2004 2
Model Parameters Model Parameters Parameters: Parameters: � n � n = 4 sequences = 4 sequences � m � m = 3 loci = 3 loci � recombination rate � recombination rate r i < 0.5 r i < 0.5 � effective � effective population size N population size N � R � R i = 2Nr i for i = 1 i = 2Nr i for i = 1 … m- -1 1 … m Katy L. Simonsen Interface 2004 3 Katy L. Simonsen Interface 2004 3
Why simulate coalescence? Why simulate coalescence? � Simulating CWR � Simulating CWR � allows the generation of large samples of DNA allows the generation of large samples of DNA � sequences with linkage sequences with linkage � statistical properties can empirically be determined statistical properties can empirically be determined � � Realistic levels of linkage disequilibrium are � Realistic levels of linkage disequilibrium are needed to discover effective strategies for needed to discover effective strategies for mapping genes in natural populations mapping genes in natural populations � Recently popular in human genetics literature � Recently popular in human genetics literature Katy L. Simonsen Interface 2004 4 Katy L. Simonsen Interface 2004 4
Original algorithm Original algorithm (Simonsen Simonsen & Churchill, 1997) & Churchill, 1997) ( � steps through Markov Chain to build tree � steps through Markov Chain to build tree � series of recombination and coalescent events series of recombination and coalescent events � � To calculate transition probabilities we need � To calculate transition probabilities we need � � the number of each type of individual present the number of each type of individual present m possible types � there are 2 there are 2 m possible types � � the probability that each type has a recombination event the probability that each type has a recombination event � � Stores a state vector Y of length 2 � Stores a state vector Y of length 2 m m � Impossible to go beyond Impossible to go beyond m m = 32 (or 64) = 32 (or 64) � � Can’t even store the indices of the vector let alone the vector! Can’t even store the indices of the vector let alone the vector! � � Limited by exponential nature � Limited by exponential nature � Could not simulate more than Could not simulate more than m m = 24 loci = 24 loci � � Very slow, especially for large Very slow, especially for large m m and and R R � Katy L. Simonsen Interface 2004 5 Katy L. Simonsen Interface 2004 5
Goal Goal � Want to be able to simulate hundreds, � Want to be able to simulate hundreds, even thousands of loci ( m m ) ) even thousands of loci ( � Also want large � Also want large R R = 2Nr where = 2Nr where 3 – � N is very large (10 N is very large (10 3 – 10 10 6 ), 6 ), � � 0 < r < 0.5 0 < r < 0.5 � � as � as m m increases, increases, r r decreases in such a decreases in such a way that mr mr = constant (genome size) = constant (genome size) way that Katy L. Simonsen Interface 2004 6 Katy L. Simonsen Interface 2004 6
What is taking so much space? What is taking so much space? � Vector � Vector Y Y changes at each step changes at each step m × � Matrix � Matrix Λ Λ (0 (0/1 /1 indicator) size 2 indicator) size 2 m × m m – –1 (fixed) 1 (fixed) � whether recombination can happen for each type whether recombination can happen for each type � Λ R R at every step at every step � Need all entries of � Need all entries of Y Y Λ � to calculate transition probabilities for next step to calculate transition probabilities for next step � � number of steps in the chain is random: � number of steps in the chain is random: � G G = 2H + = 2H + n n – – 1 where r.v. H depends on 1 where r.v. H depends on m m , , n n , , R R � � End result is a tree with End result is a tree with G G “ “events events” ” � � Have to store all this: takes � Have to store all this: takes memory memory � Have to calculate all this: takes � Have to calculate all this: takes time time Katy L. Simonsen Interface 2004 7 Katy L. Simonsen Interface 2004 7
How can we fix it? How can we fix it? � Don � Don’ ’t want to store t want to store Y Y � Don � Don’ ’t want to store t want to store Λ Λ R R � Don � Don’ ’t want to do all that multiplication t want to do all that multiplication � Don � Don’ ’t want to store the big tree with t want to store the big tree with G G events events Katy L. Simonsen Interface 2004 8 Katy L. Simonsen Interface 2004 8
Exploit structure: Y Y Exploit structure: � Y � Y contains integers in [0, contains integers in [0, n n ] ] � Sum of entries in � Sum of entries in Y Y is is ≤ ≤ nm nm � At most nm non � At most nm non- -zero elements out of zero elements out of 2 2 m m � Y � Y starts with one non starts with one non- -zero element (= zero element (= n n ) and ) and ends with one non- -zero element (=1) zero element (=1) ends with one non � Y � Y must be very sparse! must be very sparse! � At most 3 entries change at any one step � At most 3 entries change at any one step � Can store � Can store Y Y cleverly to exploit this structure cleverly to exploit this structure Katy L. Simonsen Interface 2004 9 Katy L. Simonsen Interface 2004 9
Exploit structure: Λ Λ R R Exploit structure: � Λ � Λ turns out not to be sparse turns out not to be sparse � Indicates where recombination is relevant (0 Indicates where recombination is relevant (0 – – 1) 1) � � Looks kind of sparse for small Looks kind of sparse for small m m � � Becomes more dense with Becomes more dense with m m , and is completely , and is completely � dense in the limit (Lei Liu) dense in the limit (Lei Liu) � Can Can’ ’t get away with clever storage here t get away with clever storage here � � BUT since � BUT since Y Y multiplies multiplies Λ Λ R R we don we don’ ’t need it all t need it all � Only need rows of Only need rows of Λ Λ R R with >0 entries of with >0 entries of Y Y � � Will recalculate � Will recalculate Λ Λ R R “ “on the fly on the fly” ” as entries of as entries of Y Y become >0 become >0 � Seems inefficient but trades recalculation for space Seems inefficient but trades recalculation for space � � Use bitwise calculations for speed Use bitwise calculations for speed � Katy L. Simonsen Interface 2004 10 Katy L. Simonsen Interface 2004 10
Key data representations: Y Y Λ Λ R R Key data representations: � Store � Store Y Y as a linked list as a linked list � only the non only the non- -zero entries zero entries � � Along with each element of � Along with each element of Y Y , store its , store its corresponding Λ Λ R R row row corresponding � Store index of � Store index of Y Y (type) as a multi (type) as a multi- -precision precision integer since this is still of length 2 m m integer since this is still of length 2 � use the Gnu use the Gnu Multiprecision Multiprecision ( (gmp gmp) Library (Dan ) Library (Dan’ ’s idea) s idea) � � At each step, only recalculate � At each step, only recalculate Λ Λ R R row for row for “ “new new” ” Y entries entries Y Katy L. Simonsen Interface 2004 11 Katy L. Simonsen Interface 2004 11
Exploit structure: Trees Exploit structure: Trees � Trees are � Trees are complicated when complicated when mingled but simple if mingled but simple if looked at one by one looked at one by one � Store the � Store the “ “marginal marginal” ” trees instead of the trees instead of the “joint joint” ” tree: tree: “ Katy L. Simonsen Interface 2004 12 Katy L. Simonsen Interface 2004 12
Recommend
More recommend