Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel - PowerPoint PPT Presentation

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel Aviv University

Outline � Background: SBH � Shotgun SBH � Analysis of the errorless case � Analysis of error-prone

Sequencing By Hybridization (SBH) Hybridize target to array containing a spot for each possible k -mer. TGT TGG TGA TGT TGG TGA CTT CTG CTA CTT CTG CTA GAC GAA GAT GAA GAT GAC

Sequencing By Hybridization (SBH) Hybridize target to array containing a spot for each possible k -mer. TGT TGG TGA ACTGAC ACTGAC ACTGAC ACTGAC TGT TGG TGA CTT ACTGAC ACTGAC CTG ACTGAC CTA ACTGAC CTT CTG CTA ACTGAC ACTGAC GAC GAA GAT ACTGAC GAA GAT GAC ACTGAC ACTGAC ACTGAC

Sequencing By Hybridization (SBH) Hybridize target to array containing a spot for each possible k -mer. TGT TGG TGA ACTGAC ACTGAC ACTGAC ACTGAC ACTGAC TGT TGG TGA CTT ACTGAC ACTGAC ACTGAC CTG ACTGAC CTA ACTGAC CTT CTG CTA ACTGAC ACTGAC ACTGAC GAC GAA GAT ACTGAC ACTGAC GAA GAT GAC ACTGAC ACTGAC ACTGAC ACTGAC

Sequencing By Hybridization The spectrum of a sequence: multi-set of all its k -long substrings ( k -mers). Goal: reconstruct the sequence from its spectrum. ACT CTG ACTGAC TGA GAC Pevzner 89: reconstruction is polynomial. But...

Reconstruction May Be Non-unique Different sequences can have the same spectrum: ACT, CTA, TAC ACTAC TACTA

✂ ✂ Non-uniqueness Probability P ( N, k ) : prob. that for a random sequence of length N , ∃ another sequence with same k - spectrum (failure probability). Arratia et al (97): asymptotically tight bounds for P ( N, k ) . 0.7 � N ✁ 8 P � N 0.6 ✁ 9 P 0.5 0.4 0.3 0.2 0.1 replacements 0 0 50 100 150 200 250 300 350 400

Resuscitating SBH ⇒ SBH is currently not competitive for sequencing. How can one make it competitive?

Shotgun SBH (Drmanac, Labat, Brukner, Crkvenjakov 89) 1. Fragment target S into overlapping clones; obtain the spectrum of each clone. ACTAGTTACTCTG ACT TAG TTA TAC AGT CTA GTT ACT TAG TTA GTT CTC ACT TCT TTA CTC CTG

Shotgun SBH 2. Find the correct clone map (e.g., Mayraz and Shamir, 98).

Shotgun SBH 2. Find the correct clone map (e.g., Mayraz and Shamir, 98). 3. The clones endpoints form a partition of the sequence S into subsequences called informa- tion fragments (IF). For each IF, compute its spectrum. ······ACT·························CTG············ ACT CTG CTG

Shotgun SBH ······ACT·························CTG············ ACT CTG CTG 4. Reconstruct the sequence of each IF.

Shotgun SBH ······ACT·························CTG············ ACT CTG CTG 4. Reconstruct the sequence of each IF. 5. Combine the sequences of the IFs.

Hybridization Errors Hybridization experiments are error prone. A false negative error: k -mer appears in a clone but does not appear in its measured spectrum. ······ACT·························CTG············ ACT CTG CTG

Hybridization Errors Hybridization experiments are error prone. A false negative error: k -mer appears in a clone but does not appear in its measured spectrum. CTG ······ACT·························CTG············ ACT CTG CTG

Goal Dramanac et al.: simulation evidence that shotgun SBH works in the absence of errors. Our Goal: Rigorous analysis, also considering the impact of errors.

Assumptions � Clones positions are known. � Equal size IFs ( = d ). � Each k -mer of target appears in at least one clone spectrum. � Random sequence: equiprobable bases, inde- pendent positions. � False negative probability p independently for each k -mer and for each clone.

Hybridization Errors (2) For each k -tuple P in the spectrum, we attribute P to the i -th IF where i is the maximum index of a clone in which P appears. 1 2 3 4 5 ·························CTG······························ CTG CTG CTG

Hybridization Errors (2) For each k -tuple P in the spectrum, we attribute P to the i -th IF where i is the maximum index of a clone in which P appears. 1 2 CTG 3 4 5 ·························CTG······························ CTG CTG CTG The computed index is always ≤ the true index.

Main Result N = sequence length k = probe length d = length of IFs p = false negative probability P ( N, k, d, p ) : failure probability 1 + c p � � Theorem P ( N, k, d, p ) ≤ P ( N, k, d, 0) . d

Overview of the Proof Will show: � P ( N, k, d, 0) = Ω( d 3 N 4 2 k ) . � P ( N, k, d, p ) − P ( N, k, d, 0) = O ( d 2 N 4 2 k ) .

The de-Bruijn Graph (Pevzner 89) A = a 1 · · · a n + k − 1 : the sequence. A i : the ( k − 1) -mer a i a i +1 · · · a i + k − 2 . The de-Bruijn graph of A : G A = ( V, E ) where � V = { A i : i = 1 , . . . , n + 1 } � E = { e i : i = 1 , . . . , n } , e i = ( A i , A i +1 ) TGC GCC GCT ACTGCTGCC ACT CTG

The de-Bruijn Graph TGC GCC GCT ACTGCTGCC ACT CTG Classical SBH: Any solution corresponds to an Euler path in G A .

The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 2 ACT CTG 1 Shotgun SBH w/o errors: Each edge e i has a label l i = ⌈ i d ⌉ = the number of IF containing e i . A solution corresponds to an Euler path in which each e i is in the l i -th IF (i.e. in [( l i − 1) d +1 , l i d ] ).

The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 1 ACT CTG 1 Shotgun SBH with errors: l i = number of IF containing e i ’s sequence. l ′ i = max clone containing e i ’s sequence. l ′ i ≤ l i .

The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 1 ACT CTG 1 Shotgun SBH with errors: l i = number of IF containing e i ’s sequence. l ′ i = max clone containing e i ’s sequence. l ′ i ≤ l i . � The distribution of l i − l ′ i is geometric with parameter p .

The de-Bruijn Graph TGC GCC GCT 1 2 ACTGCTGCC 1 2 1 ACT CTG 1 Shotgun SBH with errors: l i = number of IF containing e i ’s sequence. l ′ i = max clone containing e i ’s sequence. l ′ i ≤ l i . � The distribution of l i − l ′ i is geometric with parameter p . � A solution corresponds to an Euler path in which each e i is in an IF with index ≥ l ′ i .

Definitions Recall: A i - the ( k − 1) -mer a i a i +1 · · · a i + k − 2 . A pair ( i, j ) is a repeat if A i = A j . TGC GCC GCT ACTGCTGCC ACT CTG

Definitions Recall: A i - the ( k − 1) -mer a i a i +1 · · · a i + k − 2 . A pair ( i, j ) is a repeat if A i = A j . ( i, j ) is rightmost repeat if ( i + 1 , j + 1) is not a repeat. TGC GCC GCT ACTGCTGCC ACT CTG

� � Failure Conditions Interleaved pair of repeats: a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i ≤ i ′ < j < j ′ . Theorem (Pevzner 95) A sequence A is not uniquely recoverable iff either 1. A contains an interleaved pair of repeats, or 2. A 1 = A n +1 . e 1 e i 1 e i e j (2) e j (1) 1

� � Failure Conditions Interleaved pair of repeats: a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i ≤ i ′ < j < j ′ . Theorem (Pevzner 95) A sequence A is not uniquely recoverable iff either 1. A contains an interleaved pair of repeats, or 2. A 1 = A n +1 . e 1 e i replacements 1 e i e j (2) e j (1) 1

Failure Conditions - Shotgun SBH Theorem A sequence A is not uniquely recoverable iff either 1. A contains an interleaved pair of repeats ( i, j )( i ′ , j ′ ) with l i = l j ′ − 1 , or 1 2 2 1 2. A 1 = A d +1 = · · · = A cd +1 and 2 1 1 A i 1 = A i 2 = · · · = A i c � = A 1 2 1 for indices i 1 , i 2 , . . . i c with l i j = j 2 1 and i j � = ( j − 1) d + 1 . 2

Failure Probability: The Errorless Case Using the theorem we show that 4 2 k − 2 ) = Θ( d 3 n P ( N, k, d, 0) = Θ( n � d 1 � d · · 4 2 k ) 4 Arratia et al. Our bounds n k lower upper lower upper Simulation 193 8 0 0.5923 0.0051 0.1233 0.0907 791 10 0 0.2648 0.0083 0.1341 0.0996 3175 12 0.0502 0.1500 0.0094 0.1356 0.1009 12195 14 0.0742 0.1000 0.0084 0.1152 0.0875

� � Error-prone Spectra Define event X : the solution is not unique when there are errors, but is unique in the errorless case. If event X happens, then A contains a rightmost repeat ( i, j ) and a repeat ( i ′ , j ′ ) with i < j < j ′ , i ′ / ∈ [ j, j ′ ] , and l i ′ ≥ l i and either l i < l j ′ − 1 , or j ′ − 1 = dl i . e i replacements e j e i e j

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel - PowerPoint PPT Presentation

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel Aviv University Outline Background: SBH Shotgun SBH Analysis of the errorless case Analysis of error-prone Sequencing By Hybridization (SBH) Hybridize target to

submitted by: Anyesha anandita prusty Adm.no.:56c/15 Group: b HYBRIDIZATION: Crossing

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Carbon-Carbon bonds: Hybridization Gina Peschel 05.05.2011 Gina Peschel Content Content

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

Applications of Next Generation DNA Sequencing in Newborn Screening Anne Goodeve Sheffield

INVESTOR PRESENTATION CHRISTINE PARKES MANAGING DIRECTOR & CEO 8 September 2020 (ASX:WNB)

Intellectual Property Strategy in the Global Cosmetics Industry A Soap Opera Dietmar Harhoff

Computational aspects of SIM Rainer Heintzmann , - Leibniz Institute of Photonic Technology

TiresiaScope Fall Quarter Design Review DEVON PORCHER, JOHN BOWMAN, BRIAN YOUNG, TIMOTHY KWONG,

Bioinforma)cs challenges in a personalized medicine pipeline Victoria

Integra(on of the Thesaurus for the Social Sciences (TheSoz)

Smoking Experience, Alcohol Drinking, Self-reported exposure to Secondhand Smoke, and Urinary

Disclosures Dr. Benowitz has been a consultant to pharmaceutical companies that market smoking

Sambuz

Useful Links

Newsletter

Mail Us

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel - PowerPoint PPT Presentation

Large Scale Sequencing By Hybridization Ron Shamir Dekel Tsur Tel Aviv University Outline Background: SBH Shotgun SBH Analysis of the errorless case Analysis of error-prone Sequencing By Hybridization (SBH) Hybridize target to

submitted by: Anyesha anandita prusty Adm.no.:56c/15 Group: b HYBRIDIZATION: Crossing

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Genomics Sequencing tech Sequencing tech: next generation What do we get from sequencing? How

Sequencing Technologies Benchtop Production-Scale Illumina: Sequencing Platforms

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Carbon-Carbon bonds: Hybridization Gina Peschel 05.05.2011 Gina Peschel Content Content

Apicomplexan Genome Sequencing in Sanger Arnab Pain, The Pathogen Sequencing Unit (PSU) 2 nd

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

The Massive Parallel Sequencing era: &quot;Global sequencing&quot; Richard Christen CNRS UMR

Mutation detection in massively parallel sequencing 2012 Winter School in Mathematical and

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Cost effective and informative genotyping by sequencing using AgriSeq targeted sequencing for

Applications of Next Generation DNA Sequencing in Newborn Screening Anne Goodeve Sheffield

INVESTOR PRESENTATION CHRISTINE PARKES MANAGING DIRECTOR &amp; CEO 8 September 2020 (ASX:WNB)

Intellectual Property Strategy in the Global Cosmetics Industry A Soap Opera Dietmar Harhoff

Computational aspects of SIM Rainer Heintzmann , - Leibniz Institute of Photonic Technology

TiresiaScope Fall Quarter Design Review DEVON PORCHER, JOHN BOWMAN, BRIAN YOUNG, TIMOTHY KWONG,

Bioinforma)cs challenges in a personalized medicine pipeline Victoria

Integra(on of the Thesaurus for the Social Sciences (TheSoz)

Smoking Experience, Alcohol Drinking, Self-reported exposure to Secondhand Smoke, and Urinary

Disclosures Dr. Benowitz has been a consultant to pharmaceutical companies that market smoking

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

The Massive Parallel Sequencing era: "Global sequencing" Richard Christen CNRS UMR

INVESTOR PRESENTATION CHRISTINE PARKES MANAGING DIRECTOR & CEO 8 September 2020 (ASX:WNB)