National Taiwan University Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C PLOS Computational Biology. 2017 October; 13(10): e1005777 Hung-Yu Chen, R06945024 Vincent Hwang, B05902122
1 Outline Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Background · Methods and results · Conclusion
2 Background data. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Sequencing datasets are larger and larger. · New computational ideas are essential to manage and analyze
3 Minimizer Reducing storage requirements for biological sequence comparison, lexicographically smallest k -mer in it. minimizers of every L -long subsequence in S . Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, James A. Yorke; Bioinformatics, Volume 20, Issue 18, 12 December 2004, Pages 3363 − 3369 · Given a sequence of length L , the minimizer is the · Given a sequence S of any length, the minimizer set is the set of = ⇒ Every L -long subsequence in S is represented in the set.
4 Application of Minimizers Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Hashing for read overlapping · Sparse suffjx arrays · Bloom fjlters to speed up sequence search
5 Hashing for read overlapping Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
6 Sparse suffjx arrays Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
7 Bloom fjlters to speed up sequence search Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777
8 Universal hitting set(UHS) possible sequence of length L must contain at least one k -mer in Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · For integers k , L , a set U k , L is called a UHS of k -mers if every U k , L . · For example, the set of all k -mers is a trivial UHS. · Problem 1 . Given k and L , fjnd a smallest UHS of k -mers.
9 Hits possible sequence of length L . Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · A k -mer w hits string S , denoted w ⊆ S , if w is a substring in S . · k -mer set X hits string S if there exists w ∈ X such that w ⊆ S . · The UHS in Problem 1 is a set of k -mers U k , L which hits every
10 Advantages of UHS over minimizers k -mers. The method in this paper can often generate UHSs smaller by a factor of nearly k . dataset. comparable set of k -mers. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · The set of minimizers may be as large as the complete set of · UHS is universal. = ⇒ For any k and L , a UHS needs to be computed only once for every = ⇒ The data structures created for difgerent datasets will contain a
11 Using de Bruijn graphs to fjnd UHSs Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Problem 2. Given a complete de Bruijn graph D k of order k and an integer L , fjnd a smallest set of vertices U k , L such that any path in D k of length l = L − k passes through at least one vertex of U k , L .
12 Complete de Bruijn graph label of vertex u is the k -suffjx of l and the label of vertex v is the k -prefjx of l . A complete de Bruijn graph contains all possible Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · A complete de Bruijn graph of order k over alphabet Σ : V : | Σ | k vertices, each labelled with a unique k -mer. E : If there is an edge ( u , v ) with a ( k + 1) -mer label l , then the | Σ | k +1 edges of this type.
13 How to fjnd the UHS? Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · NP-hard in general(supporting information in the paper). · Heuristic approaches.(DOCKS, DOCKSany, DOCKSanyX)
14 How to fjnd UHS? 2. Find the decycling vertex set( V set), X . L length sequences. (i) DOCKS (ii) DOCKSany (iii) DOCKSanyX 5. X is the universal hitting set we’re searching for. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 1. Generate a complete de Bruijn graph of order k , set l = L − k . 3. Remove X from the graph, result in G ′ . 4. Remove vertices from G ′ and add them to S to hit the remained
15 Decycling de Bruijn graph Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · Vertices labeling · Factor · Pure cycling register( PCR k ) · V-set
16 Decycling de Bruijn graph Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 001 011 000 101 010 111 100 110
17 Vertices labeling According to the center of mass position in the coordinate Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 For a vertex v ( s 0 , s 1 , . . . , s k − 1 ) , calculate the center of mass. system, label the vertex I if x = 0 , L if x < 0 , R if x > 0 ,
18 Vertex labeling example Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 v = 010111 , the center of mass’ x value > 0. = ⇒ R . 0 1 1 0 1 1
19 Factor exactly one of the cycles. Hung-Yu Chen, R06945024, Vincent Hwang, B05902122 | Designing small universal k-mer hitting sets for improved analysis of high throughput sequencing Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford CPLOS Computational Biology. 2017 October; 13(10): e1005777 · A factor is a set of cycles such that all vertices in the graph are in · Each cycle has a unique feedback function f ( s 0 , s 1 , . . . , s k − 1 ) = s k .
Recommend
More recommend