Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Marcel H. Schulz Advanced Topics in Computational Genomics Oct 2011 with slides from Tobias Rausch (EMBL) and Kai Ye (Leiden University)
Genomic Rearrangements/ Structural Variations (SVs) • 1 Kb to several Mb in size courtesy of Tobias Rausch (EMBL)
Genomic Rearrangements/ Structural Variations (SVs) • 1 Kb to several Mb in size • Copy number variants (CNVs) – Deletion – Duplication courtesy of Tobias Rausch (EMBL)
Genomic Rearrangements/ Structural Variations (SVs) • 1 Kb to several Mb in size • Copy number variants (CNVs) – Deletion – Duplication • Insertion courtesy of Tobias Rausch (EMBL)
Genomic Rearrangements/ Structural Variations (SVs) • 1 Kb to several Mb in size • Copy number variants (CNVs) – Deletion – Duplication • Insertion, Inversion courtesy of Tobias Rausch (EMBL)
Genomic Rearrangements/ Structural Variations • 1 Kb to several Mb in size • Copy number variants (CNVs) – Deletion – Duplication • Insertion, Inversion, Translocation courtesy of Tobias Rausch (EMBL)
Genomic Rearrangements/ Structural Variations • 1 Kb to several Mb in size • Copy number variants – Deletion – Duplication • Insertion, Inversion, Translocation • More abundant than SNPs …ACGATACG… …ACGAGACG… courtesy of Tobias Rausch (EMBL)
Genomic Rearrangements/ Structural Variations • 1 Kb to several Mb in size • Copy number variants – Deletion – Duplication • Insertion, Inversion, Translocation • More abundant than SNPs • Either neutral or non-neutral in function • Non-neutral mechanisms – Disrupting genes – Creating fusion genes – Copy number changes of dosage-sensitive genes courtesy of Tobias Rausch (EMBL)
Why Structural Variation Discovery • Finding disease causal genes • Trace evolutionary genome history • Analyze the mechanisms of SVs occurrence • Understand Repetitive Element spreading (LINEs, ALUs, etc.)
Technologies to Discover Structural Variations
Technologies • Fluorescent in situ hybridization (FISH) – Fluorescent probes ( ≈ 100kb) detect and localize the presence or absence of specific DNA sequence Perry et al. (2007) courtesy of Tobias Rausch (EMBL)
Technologies • Fluorescent in situ hybridization (FISH) • Comparative Genomic Hybridization (CGH) – Test vs. reference sample – 2.1 million probes – Different types • Whole-Genome Tiling Arrays • Whole-Genome Exon-Focused Arrays • CNV Arrays courtesy of Tobias Rausch (EMBL)
Technologies • Fluorescent in situ hybridization (FISH) • Comparative Genomic Hybridization (CGH) • Genome-Wide Human SNP Array 6.0 – 1.8 million genetic markers • 906,600 SNPs • 946,000 probes for CNVs courtesy of Tobias Rausch (EMBL)
Technologies • Fluorescent in situ hybridization (FISH) • Comparative Genomic Hybridization (CGH) • Genome-Wide Human SNP Array 6.0 • Human 1M-Duo DNA Analysis BeadChip – 1.2 million genetic markers • Markers for SNPs and CNV regions – Targeted studies • 60,800 additional custom SNPs • 60,000 custom CNV-targets courtesy of Tobias Rausch (EMBL)
Technologies • Fluorescent in situ hybridization (FISH) • Comparative Genomic Hybridization (CGH) • Genome-Wide Human SNP Array 6.0 • Human 1M-Duo DNA Analysis BeadChip • Next-Generation Sequencing (NGS) – Whole-genome sequencing – Targeted, e.g. RNA-Seq courtesy of Tobias Rausch (EMBL)
Focus on NGS • Limitations of Arrays – Lower resolution for genomic rearrangements – Balanced events (e.g., inversions) cannot be detected using signal intensity differences – No breakpoint information courtesy of Tobias Rausch (EMBL)
Paired-end data • Two protocols for paired-end data – mate-pair sequencing by circularization (traditional Sanger sequencing) – paired-end NGS overview protocol
Paired-end data – paired-end NGS (insert distribution known due to fragment size selection)
Computational Methods
Experiment
Detecting Genomic Rearrangements Reference Mate-pair or paired-end mapping abnormalities Split-Read alignments Read depth signals courtesy of Tobias Rausch (EMBL)
Detecting Genomic Rearrangements Unmapped or single-anchored Reference reads Mate-pair or paired-end mapping abnormalities Split-Read alignments Local assembly Read depth signals courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
Insertions Deletions courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
Lee et al. (2009) Korbel et al. (2007) courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
courtesy of Tobias Rausch (EMBL)
1 Copy 1 Copy 0 Copy 2 Copy 2 Copy Chiang et al. (2009) courtesy of Tobias Rausch (EMBL)
• Down-Syndrom – Partial Trisomie 21 Xie et al. (2009) courtesy of Tobias Rausch (EMBL)
Human cancer cell lines compared to normal cell lines (SeqSeq algorithm, no fixed window size, multiple change points method ) Chiang et al. (2009)
With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation?
With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation? Yes – using split-read mapping Donor Reference Example for read of length 40: Expected random matches for a 12bp read-prefix in the human genome?
With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation? Yes – using split-read mapping Donor Reference Example for read of length 40: Expected random matches for a 12bp read-prefix in the human genome? 1 ⋅ 10 9 ≈ 179 4 12
With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation? Yes – using anchored split-read mapping Donor Reference mappable read mate provides anchor to narrow down search space Medvedev et al. (2009)
The Pindel algorithm (Deletions) Ye et al. (2009)
The Pindel algorithm (Deletions) How to do that? Ye et al. (2009)
The Pindel algorithm (Deletions) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3 ′ end of the unmapped read (<=2x insert size) Ye et al. (2009)
#&)-./ ! '0&12-./ ! (3 ! %0&&$). ! /)45&2 ATGCA ATCAAGTATGCTTAGC !" ! #$%&$'($) ! *!++ +, courtesy of Kai Ye (Leiden U.)
#&)-./ ! '0&12-./ ! (3 ! %0&&$). ! /)45&2 ATGCA ATCAAGTATGCTTAGC !" ! #$%&$'($) ! *!++ +, courtesy of Kai Ye (Leiden U.)
#&)-./ ! '0&12-./ ! (3 ! %0&&$). ! /)45&2 ATGCA ATCAAGTATGCTTAGC !" ! #$%&$'($) ! *!++ +, courtesy of Kai Ye (Leiden U.)
#&)-./ ! '0&12-./ ! (3 ! %0&&$). ! /)45&2 ATGCA ATCAAGTATGCTTAGC !" ! #$%&$'($) ! *!++ +, courtesy of Kai Ye (Leiden U.)
#&),-. ! '/&01,-. ! (2 ! %/&&$)- ! .)34&1 ATGCA ATCAAGTATGCTTAGC 5,-,'6' ! 6-,76$ ! 86(8&),-.9 ! :;< 5/=,'6' ! 6-,76$ ! 86(8&),-.9 ! :;<> !" ! #$%&$'($) ! *!++ *! courtesy of Kai Ye (Leiden U.)
The Pindel algorithm (Deletions) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3 ′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length + Max_D) starting from mapped end in step 2 Ye et al. (2009)
The Pindel algorithm (Deletions) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3 ′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length + Max_D) starting from mapped end in step 2 ④ check if complete unmapped read can be combined from 3’ and 5’ end substrings matches Ye et al. (2009)
The Pindel algorithm (Insertions) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3 ′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length -1) starting from mapped end in step 2 ④ check if complete unmapped read can be combined from 3’ and 5’ end substrings matches Ye et al. (2009)
Recommend
More recommend