DepthOfCoverage Genetics for Dummies 2017 NGS II – Illumina Sequencing Robert Kraaij Department of Internal Medicine r.kraaij@erasmusmc.nl
Overview • Data Analysis • Applications • Example: Exome Sequencing
Things to be addressed NGS: many short reads that might contain errors data analysis will handle these reads and errors
Overview • Data Analysis • Applications • Example: Exome Sequencing
Illumina Sequencing bridgePCR cBot flowcell HiSeq2000
Per Cycle Imaging
Per Cycle Imaging G A T C
Per Cycle Base Calling G G good quality poor quality
Quality Scoring Phred Score Incorrect base Accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 % 0 to 93 ASCII 33 to 126 = single character
FASTQ File @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC +SEQ_ID !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>
Alignment or Mapping of Reads R E F E R E N C E G E N O M E (HG19) G A T T A C G G T A C T T G C A T A G C T T A C G G T A C T T G C A T A chromosome + position + strand sample.bam
Run QC and filtering sample.bam
sortedBAM file • both reads • quality scores • chromosome • position • quality flag • duplicate flag sample.bam • off target flag
Coverage T T A C G G T A C T T G C A T G G T A C T T G C A T A G C T G A T T A C G G T A C T T G C A C G G T A C T T G C A T A G T A C G G T A C T T G C A T A G A T T A C G G T A C T T G C A T A G C T 5x coverage
Mean Coverage bases on target size of target
% of Bases Above a Certain Threshold T T A C G G T A C T T G C A T G G T A C T T G C A T A G C T G A T T A C G G T A C T T G C A C G G T A C T T G C A T A G T A C G G T A C T T G C A T A G A T T A C G G T A C T T G C A T A G C T 1x 5x 5x 4x
Variant Calling A T T A C G G T G C T T G C A C G G T G C T T G C A T A G C G A T T A C G G T G C T G C A T A G C T - T T A C G G T G C T T G C A T G G T G C T T G C A T A G C T G A T T A C G G T G C T T G C A C G G T G C T T G C A T A G T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T G = homozygous alternative
Variant Calling A T T A C G G T G C T T G C A C G G T G C T T G C A T A G C G A T T A C G G T A C T G C A T A G C T - T T A C G G T A C T T G C A T G G T G C T T G C A T A G C T G A T T A C G G T A C T T G C A C G G T G C T T G C A T A G T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A/G = heterozygous
Variant Calling G A T T A C G G T A C T T G C A C G G T G C T T G C A T A G T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T A/G = heterozygous?
Variant Calling sequencing quality poor good G A T T A C G G T A C T T G C A C G G T G C T T G C A T A G T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T G
VCF File • chromosome • position • quality • annotations sample.vcf
Variant Calling G A T T A C G G T G C T T G C A C G G T G C T T G C A T A G C G A T T A C G G T A C T G C A T A G C T - G A T T A C G G T A C T T G C A T G G T G C T T G C A T A G C T G A T T A C G G T A C T T G C A C G G T G C T T G C A T A G T A G T A C G G T G C T T G C A T A G A T T A C G G T A C T T G C A T A G C T deletion = heterozygous
Paired-End Sequencing 2 x 100 bp
Variant Calling: Mate Pairs 400 bp normal 800 bp deletion 200 bp insertion
Variant Calling: Mate Pairs 400 bp normal translocation
Variant Calling: Split Reads 800 bp genome mRNA (cDNA)
Overview • Data Analysis • Applications • Example: Exome Sequencing
Applications • Re-sequencing full genome SNPs and indels • Re-sequencing mate pairs structural variations • Re-sequencing regional SNPs and indels • Sequencing de novo assembly • RNAseq • ChIPseq • …seq
www.illumina.com
Example: Exome Sequencing
Exome Sequencing funding by NGI-NCHA, NWO, BBMRI n > 3,000 samples of random set from RS-I start May 2011; Nimblegen part of “CHARGE - S” effort: >5,000 exomes across 4 cohorts CHARGE Framingham, CHS, ARIC, Rotterdam Study Expand with exome variants array?
Exome vs Full Genome exon exon exon genome 3 Gb exome ~30 Mb
Exome Sequencing Workflow Library Exome Data DNA Sequencing preparation capture analysis isolation
Exome + capture +
Nimblegen SeqCap EZ v2 Capture • CCDS (Sept 2009) • miRBase (v14, Sept 2009) • RefSeq (Jan 2010) • 2,100,000 probes • 30,246 coding genes • 329,028 exons • 710 miRNAs • 36.5 Mb primary target • 44.1 Mb capture target
Illumina TruSeq V3 2x100 PE Sequencing
Data analysis: BWA-GATK pipeline Alignment Variant-Calling • BclToFastQ • BaseQualityScore • ANNOVAR, (CASAVA) Recalibration, VCFtools • BWA (paired) • HaplotypeCaller • Chastity Filter IndelRealignment • PlinkSeq, SKAT, • SortSam, • VQSR (GATK) R MarkDuplicates • VarEval • Spotfire (picard) Demultiplexing Processing Analysis
Sample QC and Variant QC
RSX-2 Samples were sequenced to ~54x Mean Coverage Percentage of 44Mb covered 10x or better Average Mean Depth of Coverage across the 44Mb SeqCap Exome
Mean Depth of Coverage by Flowcell Mean Depth of Coverage Flowcell Number (Roughly Chronological Order)
Freemix Values by Flowcell Estimated Freemix Values Flowcell Number (Roughly Chronological Order)
Determing Heterozygous Concordance versus 550k genotyping arrays Heterozygous Concordance Flowcell Number (Roughly Chronological Order)
Comparing Concordance versus Freemix reveals cutoff around 13% correction Heterozygous Concordance Estimated Freemix Values
Sample QC and Variant QC
Number of Detected SNPs per Samples by Flowcell Flowcell Number (Roughly Chronological Order)
Heterozygous to Homozygous ratio per Sample by Flowcell Flowcell Number (Roughly Chronological Order)
Transition to Transversion Ratio transition purines transversion pyrimidines
Transition to Transversion Ratio per Sample by Flowcell Flowcell Number (Roughly Chronological Order)
QC and filtering results
Things to Remember NGS: many short reads that might contain errors coverage indicates the number of independent reads that cover a base needed to analyse a genome FASTQ file sequence + quality scores BAM file aligned reads VCF file called variants + annotation
Recommend
More recommend