Efficient and Accurate Clustering for Large-Scale Genetic Mapping *,++ § *, ¶ * V. Strnadová (Neeley ) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez , ++ *,§, ¶ § * John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker ¶ ++ § * Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute
Motivation • High-throughput sequencing methods have produced a flood of inexpensive genetic information • Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets
The Genetic Mapping Problem Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - 𝑛 2 A B A A B A 𝑛 3 A A - - - B 𝑛 4 A - B - B B 𝑛 5 B - B A - A 𝑛 6 A A B A - - 𝑛 7 - - - A B B 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A (missing data) 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B
The Genetic Mapping Problem Data 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 1 𝑛 10 A B - - A - 𝑛 5 𝑛 3 𝑛 6 𝑛 2 𝑛 11 A B A A B A 𝑛 12 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 6 A A B A - - 𝑛 7 - - - A B B 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A (missing data) 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B
The Genetic Mapping Problem Data 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 1 𝑛 10 A B - - A - 𝑛 5 𝑛 3 𝑛 6 𝑛 2 𝑛 11 A B A A B A 𝑛 12 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 3 𝑛 6 A A B A - - 𝑛 6 𝑛 7 𝑛 10 - - - A B B 𝑛 8 A B A B - A 𝑛 4 𝑛 15 𝑛 9 A B - B - - 𝑛 13 𝑛 9 𝑛 10 B B B - A A 𝑛 7 𝑛 8 (missing data) 𝑛 11 A A A A B B 𝑛 12 𝑛 1 𝑛 12 B 𝑛 11 - A B A - 𝑛 14 𝑛 2 𝑛 13 B B - A A - 𝑛 5 𝑛 14 - Linkage group 2 - - B A A Linkage group 1 𝑛 15 B - - A A B
The Genetic Mapping Problem Data 𝒏 𝟑 𝒏 𝟐𝟒 𝒏 𝟐𝟓 𝒏 𝟐𝟔 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝒏 𝟖 cluster 𝒏 𝟐 𝒏 𝟓 𝒏 𝟗 𝒏 𝟘 𝑛 1 𝒏 𝟐𝟏 A B - - A - 𝒏 𝟔 𝒏 𝟒 𝒏 𝟕 𝑛 2 𝒏 𝟐𝟐 A B A A B A 𝒏 𝟐𝟑 Linkage group 2 𝑛 3 A A - - - B 𝑛 4 Linkage group 1 A - B - B B 𝑛 5 B - B A - A 𝑛 3 𝑛 6 A A B A - - 𝑛 6 𝑛 10 𝑛 7 - - - A B B 𝑛 8 𝑛 4 A B A B - A 𝑛 15 𝑛 9 A B - B - - 𝑛 13 𝑛 9 𝑛 7 𝑛 10 B B B - A A 𝑛 8 (missing data) 𝑛 11 A A A A B B 𝑛 12 𝑛 11 𝑛 1 𝑛 12 B - A B A - 𝑛 14 𝑛 2 𝑛 13 B 𝑛 5 B - A A - 𝑛 14 - Linkage group 2 - - B A A Linkage group 1 𝑛 15 B - - A A B
The Need for Large-Scale Clustering in Genetic Mapping • Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers • A major bottleneck is the linkage-group-finding phase • Popular mapping tools all handle this phase the same way, with an 𝑃 ( 𝑁 2 ) clustering algorithm for 𝑁 markers
The Need for Large-Scale Clustering in Genetic Mapping • Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers • A major bottleneck is the linkage-group-finding phase • Popular mapping tools all handle this phase the same way, with an 𝑃 ( 𝑁 2 ) clustering algorithm for 𝑁 markers Our solution: A fast , scalable clustering algorithm tailored to genetic marker data
Standard Approach to Genetic Marker Clustering Data 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 1 A B - - A - 𝑛 2 A B A A B A 𝑛 3 A A - - - B 𝑛 2 𝑛 13 𝑛 14 𝑛 15 𝑛 4 A - B - B B 𝑛 7 𝑛 1 𝑛 4 𝑛 8 𝑛 9 cluster 𝑛 10 𝑛 5 𝑛 5 B - B A - A 𝑛 3 𝑛 6 𝑛 6 𝑛 11 A A B A - - 𝑛 12 Linkage group 2 𝑛 7 - - - A B B Linkage group 1 𝑛 8 A B A B - A 𝑛 9 A B - B - - 𝑛 10 B B B - A A 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B
Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (1) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked • (2) Cut all edges below a LOD threshold • (3) The resulting connected components = linkage groups
Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (1) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked • (2) Cut all edges below a LOD threshold • (3) The resulting connected components = linkage groups
LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑄(𝑚𝑗𝑜𝑙𝑏𝑓 𝑗𝑘 ) 𝑛 𝑗 A B - - A - 𝑀𝑃𝐸(𝑛 𝑗 , 𝑛 𝑘 ) = log 10 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑓 𝑗𝑘 ) 𝑛 𝑘 A B A A B A Formally, 𝑆 𝑗𝑘 (1 − 𝜄 𝑗𝑘 ) 𝑂𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 +𝑂𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑂𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆+𝑂𝑆
LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝑄(𝑚𝑗𝑜𝑙𝑏𝑓 𝑗𝑘 ) 𝑛 𝑗 A B - - A - 𝑀𝑃𝐸(𝑛 𝑗 , 𝑛 𝑘 ) = log 10 𝑄(𝑜𝑝 𝑚𝑗𝑜𝑙𝑏𝑓 𝑗𝑘 ) 𝑛 𝑘 A B A A B A Formally, (1 − 𝜄 𝑗𝑘 ) 𝑆 𝑗𝑘 𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑗𝑘 𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘
LOD Score Compares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance: 𝟐 𝟒 ) 𝟑 ( 𝟐 𝟒 ) 𝟐 (𝟐 − 𝑛 𝑗 A B - - A - 𝑴𝑷𝑬(𝒏 𝒋 , 𝒏 𝒌 ) = 𝐦𝐩𝐡 𝟐𝟏 = 𝟏. 𝟏𝟖𝟓 𝟏. 𝟔 𝟒 𝑛 𝑘 A B A A B A Formally, (1 − 𝜄 𝑗𝑘 ) 𝑆 𝑗𝑘 𝑆 𝑗𝑘 𝜄 𝑗𝑘 𝑀𝑃𝐸 = log 10 0.5 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘 Where: 𝑆 𝑗𝑘 = number of recombinant offspring 𝑆 𝑗𝑘 𝑆 𝑗𝑘 = number of nonrecombinant offspring 𝜄 𝑗𝑘 = recombination fraction, i.e. 𝑆 𝑗𝑘 + 𝑆 𝑗𝑘
Standard Approach to Genetic Marker Clustering 𝑛 13 𝑛 14 𝑗 1 𝑗 2 𝑗 3 𝑗 4 𝑗 5 𝑗 6 𝑛 7 𝑛 4 𝑛 1 A B - - A - 𝑛 5 𝑛 10 𝑛 2 A B A A B A 𝑛 3 𝑛 11 𝑛 6 𝑛 3 A A - - - B 𝑛 12 (2) 𝑛 4 A - B - B B 𝑛 5 B - B A - A Linkage group 1 𝑛 6 A A B A - - 𝑛 2 𝑛 15 𝑛 7 - - - A B B 𝑛 8 𝑛 1 A B A B - A 𝑛 8 𝑛 9 𝑛 9 A B - B - - 𝑛 10 B B B - A A Linkage group 2 𝑛 11 A A A A B B 𝑛 12 B - A B A - 𝑛 13 B B - A A - 𝑛 14 - - - B A A 𝑛 15 B - - A A B (1) Compute the similarity between all 𝑃(𝑁 2 ) pairs of markers, producing a complete graph with 𝑁 vertices • Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked (2) Cut all edges below a LOD threshold
Recommend
More recommend