Optical Mapping Data: Data Generation and Algorithms
Sample Preparation Fragments Sequencing Reads Assembly Contigs Analysis
What is an Optical Map? Optical maps are ordered, genome-wide, high- resolution restriction maps. GGCTT CCGA CCACCACAA CCGA ATTATGAAGGATA CCGA A 6,19,35 - Much longer than reads. For example, the average map size for goat covers 360,000 bp - Now commercially available
Microfludic device Isolated DNA DNA is elongated and cleaved on the optical mapping surface . Epiflourescence microscope with CCD camera
6 3 3 9 4
3 9 4 6 3 6 3 9 4 Genome wide optical map
“There is [..] a critical need for the continued development and public release of software tools for processing optical mapping data ..” -GigaScience 2014
Sample Preparation Optical Map Data Genome-wide optical map Sequencing Goal: tool to align the contig to a segment of an contigs optical map Assembly Analysis
Challenges • Previous approaches use dynamic programming • Burrows-Wheeler Transform (BWT) would improve time efficiency • Challenges in applying BWT: (1) Sizing error and (2) alphabet size Actual optical map 6 9 4 3 values Optical map obtained 9.5 6 5 4 from experiment 0.5 2 1 1 SIZING ERROR
Challenges • Previous approaches use dynamic programming • Burrows-Wheeler Transform (BWT) would improve time efficiency • Challenges in applying BWT: (1) Sizing error and (2) alphabet size Actual optical map 6 9 4 3 values Optical map obtained 9.5 6 5 4 from experiment 0.5 2 1 1 SIZING ERROR
� � Challenges • Previous approaches use dynamic programming • Burrows-Wheeler Transform (BWT) would improve time efficiency • Challenges in applying BWT: (1) Sizing error and (2) alphabet size ! 𝑣𝑜𝑗𝑟𝑣𝑓 𝑔𝑠𝑏𝑛𝑓𝑜𝑢 𝑡𝑗𝑨𝑓𝑡 > 16,000
Sample Preparation Optical Map Data Genome-wide optical map Twin Sequencing Contigs Assembly Alignment of contigs to optical map Analysis
Contig 2 Contig 4 Contig 1 Contig 3 Contig 5
Twin Algorithm 1. In silico digest contigs into optical maps. TTT CCGA CCACTTTT CCGA ATTATGA CCGA A 4,13,24
Twin Algorithm 1. In silico digest contigs into optical maps. 2. Build FM-index* and auxiliary data structures on the genome-wide optical map. * a data structure that allows compression of the input text while still permitting fast substring queries
BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n
BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!
BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!
BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!
BWT and FM-index A suffix array ( SA ) of string S is an array of the suffixes of S sorted into alphabetical order. 3 aaacg n 1 acaaacg n 4 aacg n 2 caaacg n 1 acaaacg n 3 aaacg n acaaacg n 5 acg n 4 aacg n 2 caaacg n 5 acg n 6 cg n 6 cg n 7 g n 7 g n 8 n 8 n The suffix array clusters all the occurrences of every pattern together into a contiguous range!
BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. 3 aaacg n ac c 4 aacg n aca a Extract last column of SA 1 acaaacg n n acaaacg n 5 acg n acaa a 2 caaacg n a a 6 cg n acaaa a 7 g n acaaac c 8 n acaaacg g
BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. 3 aaacg n ac c 0 4 aacg n aca a 0 1 acaaacg n n 0 acaaacg n 5 acg n acaa a 1 2 caaacg n a a 2 6 cg n acaaa a 3 7 g n acaaac c 1 8 n acaaacg g 0 BWT rank rank K (i): return the number of K ’s in S[1,i]
BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. 3 aaacg n ac c 0 4 aacg n aca a 0 1 acaaacg n n 0 acaaacg n 5 acg n acaa a 1 rank a [5] = 2 2 caaacg n a a 2 6 cg n acaaa a 3 7 g n acaaac c 1 8 n acaaacg g 0 BWT rank rank K (i): return the number of K ’s in S[1,i]
BWT and FM-index The Burrows-Wheeler Transform ( BWT ) is a permutation of the string such that BWT[i] = S[SA[i] - 1]. BWT rank 3 aaacg n ac c 0 4 aacg n aca a 0 1 acaaacg n n 0 acaaacg n 5 acg n acaa a 1 2 caaacg n a a 2 6 cg n acaaa a 3 7 g n acaaac c 1 8 n acaaacg g 0 FM-index is the compressed version of the BWT and rank .
Twin Algorithm 1. In silico digest contigs into optical maps. 2. Build FM-index and auxiliary data structures on the genome-wide optical map. 3. Using the FM-index we find all alignments between the optical map and the in silico digested contigs. - Modified FM-index Backward Search Algorithm
FM-Index Backward Search A recursive algorithm for finding substrings using rank and BWT rank[a] rank[a] rank[c]
Modified FM-Index Backward Search • Sizing error and alphabet size are challenges to overcome • We cannot afford a brute force enumeration of the alphabet at each step in the backward search • Novelty for optical maps: Wavelet Tree
Wavelet Tree A Wavelet Tree converts a string into a balanced binary-tree of bit vectors, where a 0 replaces half of the symbols, and a 1 replaces the other half. This definition is applied recursive
Wavelet Tree {A,C,G,T} is encoded as {0,0,1,1} ACGTATATAGGAAGA 001101010110010
Wavelet Tree {A,C,G,T} is encoded as {0,0,1,1} ACGTATATAGGAAGA 001101010110010
Wavelet Tree {A,C} is encoded as {0,1} ACGTATATAGGAAGA 001101010110010 0 ACAAAAAA 01000000 No ambiguity!
Wavelet Tree {G,T} is encoded as {0,1} ACGTATATAGGAAGA 001101010110010 1 0 ACAAAAAA GTTTGGG 01000000 0111000 Which symbols in {A, G} exist in input string?
Modified FM-Index Backward Search To match x we need to find all the substrings within the range x +/- y, for tolerance y.
Modified FM-Index Backward Search To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3. Genome wide 2,11,10,23,53,3,5,10,14,9,11 optical map 0, 1, 0, 1, 1,0,0, 0, 1,0, 1
Modified FM-Index Backward Search To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3. 2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1 2,10,3,5,10,9 11,23,53,14,11 0, 1,0,0, 1,1 0, 1, 1, 0, 0 11,14,11 23,53 2,3,5 10,9,10 0, 1, 0 0, 1 0,0,1 0,1, 0 2,3 5 0,1 1
Modified FM-Index Backward Search To match 9 we need to find all the substrings within the range [6, 12] , for tolerance 3. 2,11,10,23,53,3,5,10,14,9,11 0, 1, 0, 1, 1,0,0, 0, 1,0, 1 2,10,3,5,10,9 11,23,53,14,11 0, 1,0,0, 1,1 0, 1, 1, 0, 0 11,14,11 23,53 2,3,5 10,9,10 0, 1, 0 0, 1 0,0,1 0,1, 0 2,3 5 0,1 1
Modified FM-Index Backward Search A recursive algorithm for finding substrings using rank and BWT rank[a] rank[a] rank[c] Wavelet Tree Query
Twin Algorithm 1. In silico digest contigs into optical maps. 2. Build FM-index and auxiliary data structures on the genome-wide optical map. 3. Using the FM-index we find all alignments between the optical map and the in silico digested contigs. 4. Output the alignments in PSL format.
TWIN Test Datasets
TWIN Results
TWIN: Optical Map Aligner Twin is the first alignment method that is capable of handling large genome sizes The only index-based tool and is orders of • magnitude faster than existing approaches (patent pending) Pine tree (20 Gb) would take ~84 machine years • with SOMA but a couple hours with Twin
CORRECTING ERRORS IN GENOMES
Mis-assembly in Genomes Mis-assembly: Significantly large insertion, deletion, inversion, or rearrangement that is the result of decisions made by the assembly program Correct assembly A R R B Rearrangement B A R R Deletion A R B Insertion A R R R B
Extensive vs. Local Mis-assemblies Extensive Mis-assembly: 1 kbp in size and regions align to different strands or different chromosomes. Local Mis-assembly: smaller in size and on the same strand and same chromosome.
De Bruijn Graph of a Genome Example Genome: ABCDEFGHICDEFGKL Example Genome: ABCDEFGHICDEFGKL GHI HIC 2 ICD FGH CDE ABC BCD DEF EFG FGK GKL 1 3
Recommend
More recommend