large scale sequencing by hybridization
play

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel - PowerPoint PPT Presentation

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel Aviv University Outline Background: SBH Shotgun SBH Analysis of the errorless case Analysis of error-prone Sequencing By Hybridization (SBH) Hybridize target to


  1. Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel Aviv University

  2. Outline � Background: SBH � Shotgun SBH � Analysis of the errorless case � Analysis of error-prone

  3. Sequencing By Hybridization (SBH) Hybridize target to array containing a spot for each possible k -mer. TGT TGG TGA TGT TGG TGA CTT CTG CTA CTT CTG CTA GAC GAA GAT GAA GAT GAC

  4. Sequencing By Hybridization (SBH) Hybridize target to array containing a spot for each possible k -mer. TGT TGG TGA ACTGAC ACTGAC ACTGAC ACTGAC TGT TGG TGA CTT ACTGAC ACTGAC CTG ACTGAC CTA ACTGAC CTT CTG CTA ACTGAC ACTGAC GAC GAA GAT ACTGAC GAA GAT GAC ACTGAC ACTGAC ACTGAC

  5. Sequencing By Hybridization (SBH) Hybridize target to array containing a spot for each possible k -mer. TGT TGG TGA ACTGAC ACTGAC ACTGAC ACTGAC ACTGAC TGT TGG TGA CTT ACTGAC ACTGAC ACTGAC CTG ACTGAC CTA ACTGAC CTT CTG CTA ACTGAC ACTGAC ACTGAC GAC GAA GAT ACTGAC ACTGAC GAA GAT GAC ACTGAC ACTGAC ACTGAC ACTGAC

  6. Sequencing By Hybridization The spectrum of a sequence: multi-set of all its k -long substrings ( k -mers). Goal: reconstruct the sequence from its spectrum. ACT CTG ACTGAC TGA GAC Pevzner 89: reconstruction is polynomial. But...

  7. Reconstruction May Be Non-unique Different sequences can have the same spectrum: ACT, CTA, TAC ACTAC TACTA

  8. ✂ ✂ Non-uniqueness Probability P ( N, k ) : prob. that for a random sequence of length N , ∃ another sequence with same k - spectrum (failure probability). Arratia et al (97): asymptotically tight bounds for P ( N, k ) . 0.7 � N ✁ 8 P � N 0.6 ✁ 9 P 0.5 0.4 0.3 0.2 0.1 replacements 0 0 50 100 150 200 250 300 350 400

  9. Resuscitating SBH ⇒ SBH is currently not competitive for sequenc- ing. How can one make it competitive?

  10. Shotgun SBH (Drmanac, Labat, Brukner, Crkvenjakov 89) 1. Fragment target S into overlapping clones; obtain the spectrum of each clone. ACTAGTTACTCTG ACT TAG TTA TAC AGT CTA GTT ACT TAG TTA GTT CTC ACT TCT TTA CTC CTG

  11. Shotgun SBH 2. Find the correct clone map (e.g., Mayraz and Shamir, 98).

  12. Shotgun SBH 2. Find the correct clone map (e.g., Mayraz and Shamir, 98). 3. The clones endpoints form a partition of the sequence S into subsequences called informa- tion fragments (IF). For each IF, compute its spectrum. ······ACT·························CTG············ ACT CTG CTG

  13. Shotgun SBH ······ACT·························CTG············ ACT CTG CTG 4. Reconstruct the sequence of each IF.

  14. Shotgun SBH ······ACT·························CTG············ ACT CTG CTG 4. Reconstruct the sequence of each IF. 5. Combine the sequences of the IFs.

  15. Hybridization Errors Hybridization experiments are error prone. A false negative error: k -mer appears in a clone but does not appear in its measured spectrum. ······ACT·························CTG············ ACT CTG CTG

  16. Hybridization Errors Hybridization experiments are error prone. A false negative error: k -mer appears in a clone but does not appear in its measured spectrum. CTG ······ACT·························CTG············ ACT CTG CTG

  17. Goal Dramanac et al.: simulation evidence that shotgun SBH works in the absence of errors. Our Goal: Rigorous analysis, also considering the impact of errors.

  18. Assumptions � Clones positions are known. � Equal size IFs ( = d ). � Each k -mer of target appears in at least one clone spectrum. � Random sequence: equiprobable bases, inde- pendent positions. � False negative probability p independently for each k -mer and for each clone.

  19. Hybridization Errors (2) For each k -tuple P in the spectrum, we attribute P to the i -th IF where i is the maximum index of a clone in which P appears. 1 2 3 4 5 ·························CTG······························ CTG CTG CTG

  20. Hybridization Errors (2) For each k -tuple P in the spectrum, we attribute P to the i -th IF where i is the maximum index of a clone in which P appears. 1 2 3 4 5 ·························CTG······························ CTG CTG CTG

  21. Hybridization Errors (2) For each k -tuple P in the spectrum, we attribute P to the i -th IF where i is the maximum index of a clone in which P appears. 1 2 CTG 3 4 5 ·························CTG······························ CTG CTG CTG The computed index is always ≤ the true index.

  22. Main Result N = sequence length k = probe length d = length of IFs p = false negative probability P ( N, k, d, p ) : failure probability 1 + c p � � Theorem P ( N, k, d, p ) ≤ P ( N, k, d, 0) . d

  23. Overview of the Proof Will show: � P ( N, k, d, 0) = Ω( d 3 N 4 2 k ) . � P ( N, k, d, p ) − P ( N, k, d, 0) = O ( d 2 N 4 2 k ) .

  24. The de-Bruijn Graph (Pevzner 89) A = a 1 · · · a n + k − 1 : the sequence. A i : the ( k − 1) -mer a i a i +1 · · · a i + k − 2 . The de-Bruijn graph of A : G A = ( V, E ) where � V = { A i : i = 1 , . . . , n + 1 } � E = { e i : i = 1 , . . . , n } , e i = ( A i , A i +1 ) TGC GCC GCT ACTGCTGCC ACT CTG

  25. The de-Bruijn Graph TGC GCC GCT ACTGCTGCC ACT CTG Classical SBH: Any solution corresponds to an Euler path in G A .

  26. The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 2 ACT CTG 1 Shotgun SBH w/o errors: Each edge e i has a label l i = ⌈ i d ⌉ = the number of IF containing e i . A solution corresponds to an Euler path in which each e i is in the l i -th IF (i.e. in [( l i − 1) d +1 , l i d ] ).

  27. The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 1 ACT CTG 1 Shotgun SBH with errors: l i = number of IF containing e i ’s sequence. l ′ i = max clone containing e i ’s sequence. l ′ i ≤ l i .

  28. The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 1 ACT CTG 1 Shotgun SBH with errors: l i = number of IF containing e i ’s sequence. l ′ i = max clone containing e i ’s sequence. l ′ i ≤ l i . � The distribution of l i − l ′ i is geometric with parameter p .

  29. The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 1 ACT CTG 1 Shotgun SBH with errors: l i = number of IF containing e i ’s sequence. l ′ i = max clone containing e i ’s sequence. l ′ i ≤ l i . � The distribution of l i − l ′ i is geometric with parameter p . � A solution corresponds to an Euler path in which each e i is in an IF with index ≥ l ′ i .

  30. Definitions Recall: A i - the ( k − 1) -mer a i a i +1 · · · a i + k − 2 . A pair ( i, j ) is a repeat if A i = A j . TGC GCC GCT ACTGCTGCC ACT CTG

  31. Definitions Recall: A i - the ( k − 1) -mer a i a i +1 · · · a i + k − 2 . A pair ( i, j ) is a repeat if A i = A j . ( i, j ) is rightmost repeat if ( i + 1 , j + 1) is not a repeat. TGC GCC GCT ACTGCTGCC ACT CTG

  32. � � Failure Conditions Interleaved pair of repeats: a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i ≤ i ′ < j < j ′ . Theorem (Pevzner 95) A sequence A is not uniquely recoverable iff either 1. A contains an interleaved pair of repeats, or 2. A 1 = A n +1 . e 1 e i 1 e i e j (2) e j (1) 1

  33. � � Failure Conditions Interleaved pair of repeats: a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i ≤ i ′ < j < j ′ . Theorem (Pevzner 95) A sequence A is not uniquely recoverable iff either 1. A contains an interleaved pair of repeats, or 2. A 1 = A n +1 . e 1 e i replacements 1 e i e j (2) e j (1) 1

  34. � � Failure Conditions Interleaved pair of repeats: a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i ≤ i ′ < j < j ′ . Theorem (Pevzner 95) A sequence A is not uniquely recoverable iff either 1. A contains an interleaved pair of repeats, or 2. A 1 = A n +1 . e 1 e i replacements 1 e i e j (2) e j (1) 1

  35. Failure Conditions - Shotgun SBH Theorem A sequence A is not uniquely recover- able iff either 1. A contains an interleaved pair of repeats ( i, j )( i ′ , j ′ ) with l i = l j ′ − 1 , or 1 2 2 1 2. A 1 = A d +1 = · · · = A cd +1 and 2 1 1 A i 1 = A i 2 = · · · = A i c � = A 1 2 1 for indices i 1 , i 2 , . . . i c with l i j = j 2 1 and i j � = ( j − 1) d + 1 . 2

  36. Failure Probability: The Errorless Case Using the theorem we show that 4 2 k − 2 ) = Θ( d 3 n P ( N, k, d, 0) = Θ( n � d 1 � d · · 4 2 k ) 4 Arratia et al. Our bounds n k lower upper lower upper Simulation 193 8 0 0.5923 0.0051 0.1233 0.0907 791 10 0 0.2648 0.0083 0.1341 0.0996 3175 12 0.0502 0.1500 0.0094 0.1356 0.1009 12195 14 0.0742 0.1000 0.0084 0.1152 0.0875

  37. � � Error-prone Spectra Define event X : the solution is not unique when there are errors, but is unique in the errorless case. If event X happens, then A contains a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i < j < j ′ , i ′ / ∈ [ j, j ′ ] , and l i ′ ≥ l i and either l i < l j ′ − 1 , or j ′ − 1 = dl i . e i replacements e j e i e j

Recommend


More recommend