optical mapping data data generation and algorithms
play

Optical Mapping Data: Data Generation and Algorithms Sample - PowerPoint PPT Presentation

Optical Mapping Data: Data Generation and Algorithms Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis What is an Optical Map? Optical maps are ordered, genome-wide, high- resolution restriction maps. GGCTT CCGA


  1. Optical Mapping Data: Data Generation and Algorithms

  2. Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis

  3. What is an Optical Map? Optical maps are ordered, genome-wide, high- resolution restriction maps. GGCTT CCGA CCACCACAA CCGA ATTATGAAGGATA CCGA A 6,19,35 - Much longer than reads. For example, the average map size for goat covers 360,000 bp - Now commercially available

  4. Microfludic device Isolated DNA DNA is elongated and cleaved on the optical mapping surface . Epiflourescence microscope with CCD camera

  5. 6 3 3 9 4

  6. 3 9 4 6 3 6 3 9 4 Genome wide optical map

  7. “There is [..] a critical need for the continued development and public release of software tools for processing optical mapping data ..” -GigaScience 2014

  8. Sample Preparation Optical Map Data Genome-wide optical map Sequencing Goal: tool to align the contig to a segment of an contigs optical map Assembly Analysis

  9. Challenges • Previous approaches use dynamic programming • Burrows-Wheeler Transform (BWT) would improve time efficiency • Challenges in applying BWT: (1) Sizing error and (2) alphabet size Actual optical map 6 9 4 3 values Optical map obtained 9.5 6 5 4 from experiment 0.5 2 1 1 SIZING ERROR

  10. Challenges • Previous approaches use dynamic programming • Burrows-Wheeler Transform (BWT) would improve time efficiency • Challenges in applying BWT: (1) Sizing error and (2) alphabet size Actual optical map 6 9 4 3 values Optical map obtained 9.5 6 5 4 from experiment 0.5 2 1 1 SIZING ERROR

  11. � � Challenges • Previous approaches use dynamic programming • Burrows-Wheeler Transform (BWT) would improve time efficiency • Challenges in applying BWT: (1) Sizing error and (2) alphabet size ! 𝑣𝑜𝑗𝑟𝑣𝑓 𝑔𝑠𝑏𝑕𝑛𝑓𝑜𝑢 𝑡𝑗𝑨𝑓𝑡 > 16,000

  12. Sample Preparation Optical Map Data Genome-wide optical map Twin Sequencing Contigs Assembly Alignment of contigs to optical map Analysis

  13. Contig 2 Contig 4 Contig 1 Contig 3 Contig 5

  14. Twin Algorithm 1. In silico digest contigs into optical maps. TTT CCGA CCACTTTT CCGA ATTATGA CCGA A 4,13,24

  15. Twin Algorithm 1. In silico digest contigs into optical maps. 2. Build FM-index* and auxiliary data structures on the genome-wide optical map. * a data structure that allows compression of the input text while still permitting fast substring queries

  16. BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n

  17. BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!

  18. BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!

  19. BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!

  20. BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!

  21. BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. 3 aaacg n ac c 4 aacg n aca a Extract last column of SA 1 acaaacg n n acaaacg n 5 acg n acaa a 2 caaacg n a a 6 cg n acaaa a 7 g n acaaac c 8 n acaaacg g

  22. BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. 3 aaacg n ac c 0 4 aacg n aca a 0 1 acaaacg n n 0 acaaacg n 5 acg n acaa a 1 2 caaacg n a a 2 6 cg n acaaa a 3 7 g n acaaac c 1 8 n acaaacg g 0 BWT rank rank K (i): return the number of K ’s in S[1,i]

  23. BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. 3 aaacg n ac c 0 4 aacg n aca a 0 1 acaaacg n n 0 acaaacg n 5 acg n acaa a 1 rank a [5] = 2 2 caaacg n a a 2 6 cg n acaaa a 3 7 g n acaaac c 1 8 n acaaacg g 0 BWT rank rank K (i): return the number of K ’s in S[1,i]

  24. BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. BWT rank 3 aaacg n ac c 0 4 aacg n aca a 0 1 acaaacg n n 0 acaaacg n 5 acg n acaa a 1 2 caaacg n a a 2 6 cg n acaaa a 3 7 g n acaaac c 1 8 n acaaacg g 0 FM-index is the compressed version of the BWT and rank .

  25. Twin Algorithm 1. In silico digest contigs into optical maps. 2. Build FM-index and auxiliary data structures on the genome-wide optical map. 3. Using the FM-index we find all alignments between the optical map and the in silico digested contigs. - Modified FM-index Backward Search Algorithm

  26. FM-Index Backward Search A recursive algorithm for finding substrings using rank and BWT rank[a] rank[a] rank[c]

  27. Modified FM-Index Backward Search • Sizing error and alphabet size are challenges to overcome • We cannot afford a brute force enumeration of the alphabet at each step in the backward search • Novelty for optical maps: Wavelet Tree

  28. Wavelet Tree A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other half. This definition is applied recursive

  29. Wavelet Tree {A,C,G,T} is encoded as {0,0,1,1} ACGTATATAGGAAGA 001101010110010

  30. Wavelet Tree {A,C,G,T} is encoded as {0,0,1,1} ACGTATATAGGAAGA 001101010110010

  31. Wavelet Tree {A,C} is encoded as {0,1} ACGTATATAGGAAGA 001101010110010 0 ACAAAAAA 01000000 No ambiguity!

  32. Wavelet Tree {G,T} is encoded as {0,1} ACGTATATAGGAAGA 001101010110010 1 0 ACAAAAAA GTTTGGG 01000000 0111000 Which symbols in {A, G} exist in input string?

  33. Modified FM-Index Backward Search To match x we need to find all the substrings within the range x +/- y, for tolerance y.

  34. Modified FM-Index Backward Search To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3. Genome wide 2,11,10,23,53,3,5,10,14,9,11 optical map 0, 1, 0, 1, 1,0,0, 0, 1,0, 1

  35. Modified FM-Index Backward Search To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3. 2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1 2,10,3,5,10,9 11,23,53,14,11 0, 1,0,0, 1,1 0, 1, 1, 0, 0 11,14,11 23,53 2,3,5 10,9,10 0, 1, 0 0, 1 0,0,1 0,1, 0 2,3 5 0,1 1

  36. Modified FM-Index Backward Search To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3. 2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1 2,10,3,5,10,9 11,23,53,14,11 0, 1,0,0, 1,1 0, 1, 1, 0, 0 11,14,11 23,53 2,3,5 10,9,10 0, 1, 0 0, 1 0,0,1 0,1, 0 2,3 5 0,1 1

  37. Modified FM-Index Backward Search A recursive algorithm for finding substrings using rank and BWT rank[a] rank[a] rank[c] Wavelet Tree Query

  38. Twin Algorithm 1. In silico digest contigs into optical maps. 2. Build FM-index and auxiliary data structures on the genome-wide optical map. 3. Using the FM-index we find all alignments between the optical map and the in silico digested contigs. 4. Output the alignments in PSL format.

  39. TWIN Test Datasets

  40. TWIN Results

  41. TWIN: Optical Map Aligner Twin is the first alignment method that is capable of handling large genome sizes The only index-based tool and is orders of • magnitude faster than existing approaches (patent pending) Pine tree (20 Gb) would take ~84 machine years • with SOMA but a couple hours with Twin

  42. CORRECTING ERRORS IN GENOMES

  43. Mis-assembly in Genomes Mis-assembly: Significantly large insertion, deletion, inversion, or rearrangement that is the result of decisions made by the assembly program Correct assembly A R R B Rearrangement B A R R Deletion A R B Insertion A R R R B

  44. Extensive vs. Local Mis-assemblies Extensive Mis-assembly: 1 kbp in size and regions align to different strands or different chromosomes. Local Mis-assembly: smaller in size and on the same strand and same chromosome.

  45. De Bruijn Graph of a Genome Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL GHI HIC 2 ICD FGH CDE ABC BCD DEF EFG FGK GKL 1 3

Recommend


More recommend