Genotype Imputation in Genome-wide Association Studies Fernando Rivadeneira 1,2 1 Department of Internal Medicine 2 Department of Epidemiology Course “SNP’s and Human Diseases” Rotterdam November 12 th , 2018
Imputation Facilitates Meta-Analysis and has been the key of the success of GWAS • Need large sample size for the detection of moderate effects • One convenient approach: meta-analysis • However, different GWAS may use different genotyping chips (examining a different set of SNPs) – E.g., 3-way meta analysis on lipid concentrations • FUSION: Illumina 300K • DGI and SardiNIA: Affymetrix 500K – Illumina 300K and Affymetrix 500K have <10% (45K) SNPs in common
Adapted Marchini & Howie . Nat. Rev. 2010 Why is imputation important? Association Study of genotyped data GWAS data with missing genotypes Imputation to a reference panel Association Study of imputed data GWAS imputed data
Imputation allows integration of results across different platforms (example HMGA region) 4
LDL-C association illustrates how scarce the SNP density will be when limited to shared markers SNPs typed by all 3 groups (~45K) Affy panel SNPs ( ~ 321K) Imputed SNPs (~ 2.25 million) Willer et al, Nat Genet 40: 161-9, 2008
Increase in genome coverage is also facilitated by Imputation • HapMap estimated ~10M common SNPs in the human genome (>3M genotyped) • GWAS examine 100K – 1M SNPs • These SNPs can be proxies for many others – E.g. Illumina 300K covers ~78% of Hapmap • What do we achieve with imputation?
With imputation we fill-in the gaps (missing information) using a reference panel • Start with … – 100 K – 1M SNPs from a GWAS – ~3M SNPs genotyped in HapMap… • ~40-60 Million SNPs sequenced in reference (HRC) • Then impute genotypes in study samples for ALL the markers in Reference Set but absent from study
Imputation works by determining “shared” regions of the genome and copying over • Rationale : Even unrelated individuals typically share short stretches of chromosome • Heuristic Procedure : Identify the shared stretches of chromosome and impute by copying over
Using the LD relationships between SNPs genotypes can be predicted (imputed) for SNPs not measured experimentally Observed Genotypes Study . . . . A . . . . . . . A . . . . A . . . Sample . . . . G . . . . . . . C . . . . A . . . 9
Observed genotypes are phased and compared to the set of phased haplotypes of the reference Observed Genotypes Study . . . . A . . . . . . . A . . . . A . . . Sample . . . . G . . . . . . . C . . . . A . . . Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C HapMap / C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C 1K genomes T G G G A T C T C C C G A C C T C A T G C Project C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C /HRC C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T C C T G T G C 10
A probalistic search for mosaics among the reference is done looking for similar stretches of flanking haplotypes Observed Genotypes Study . . . . A . . . . . . . A . . . . A . . . Sample . . . . G . . . . . . . C . . . . A . . . Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C HapMap / C G A G A C T C T C C G A C C T T A T G C 1K genomes T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C Project C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C /HRC C G A A G C T C T T T T C T C C T G T G C 11
Next step is impute missing genotypes from the reference using an algorithm which models each haplotype conditional on all others Observed Genotypes Study c g a g A t c t c c c g A c c t c A t g g Sample c g a a G c t c t t t t C t c t c A t g c Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C HapMap / C G A G A C T C T C C G A C C T T A T G C 1K genomes T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C Project C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C /HRC C G A A G C T C T T T T C T C C T G T G C 12
Imputation increases power by increasing sample size ( individuals with missing genotypes ), allowing higher LD and decreasing error Observed Genotypes c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t c t c A t g c Example GG GA AA Best Reference Haplotypes 0 1 2 guess C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G Allele C C A A G C T C T T T T C T T C T G T G C 0.68 dose C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G C Best guess information is C G A G A T C T C C C G A C C T T G T G C never used with low C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C imputation quality scores! C G A A G C T C T T T T C T C C T G T G C Little information is always better than NO information! 13
Imputation works providing reliable test statics • Michigan Age-related Comparison of Test Macular Degeneration Statistics Study 150 Experimental Experimental • Dense genotyping in a 123Kb region 100 overlapping CFH • Used 11 tagSNPs to 50 predict 84 SNPs • Imputed genotypes 0 differ from the 0 50 100 150 experimental ones Imputed Imputed only <1% of the time Li et al, Nat Genet 38: 1049-54, 2006
Imputation works providing reliable test statics and effect estimates Allele frequency P-value Odds ratio Imputed Genotyped Imputed Genotyped Imputed Genotyped 2.5 x 10 -6 6.3 x 10 -6 .024 .021 2.57 2.20 5.3 x 10 -6 1.1 x 10 -5 .543 .540 1.33 1.31 2.0 x 10 -5 4.1 x 10 -5 .114 .136 1.47 1.41 6.6 x 10 -5 5.5 x 10 -5 .494 .490 1.28 1.28 7.5 x 10 -5 9.0 x 10 -5 .927 .924 1.72 1.65 1.4 x 10 -4 3.9 x 10 -4 .744 .753 1.33 1.30 1.7 x 10 -4 1.2 x 10 -4 .289 .291 1.27 1.28 1.9 x 10 -4 3.6 x 10 -5 .970 .973 2.47 2.58 .401 .361 6.3 x 10 -4 1.6 x 10 -3 1.26 1.22 .817 .816 9.5 x 10 -4 1.0 x 10 -3 1.31 1.30 .605 .605 9.9 x 10 -4 1.2 x 10 -3 1.23 1.22 Scott et al, Science 316: 1341-5, 2007
Imputation algorithm uses a hidden Markov model (MACH) Hidden State S m : The pair of contributing reference haplotypes at marker m Data G m : Observed genotypes at marker m Goal : Infer S m
Output from the imputation (MACH) g A c c t c A t g g Iteration 1 t C t t t c A t g g g A c c t c A t g g Iteration 2 Best-guess or the t C c c t c A t g c most frequently occurring g A c c t c A t g g Iteration 3 genotype guess t C c c t c A t g c across all Proportion of g A c c t c A t g g Iteration 4 iterations iterations where the t C c c t c A t g c guessed genotype agrees with the g A c c t c A t g g Consensus consensus t C c c t c A t g c Estimated fractional 1 1 3/4 1 1 1 1 1 Quality Score 3/4 3/4 count of the g A c c t c A t g g Reference Allele reference allele 1 1 2 2 2 2 2 Dosage 7/4 7/4 5/4
With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation) • For large projects (n > 1,000 individuals) phasing each chromosome can take a while.. • Split data in chunks of 2000 markers (+50 in flanking regions) • (MACH) Phase each chunk independently • Ligate chunks to reconstruct chromosome (using flanking regions) • (Minimac) Impute pre-phased chunks with a particular reference (e.g. different versions of 1000G) ⇒ Prefer splitting over markers than individuals ⇒ http://genome.sph.umich.edu/wiki/Minimac:_1000_Genomes_Imputation_Cookbook
With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation) Phasing & Analyzing Data QC-ing Genotyping Imputing
With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation) Phasing & Analyzing Data QC-ing Genotyping Imputing Imputing Analyzing Genotyping Data QC-ing Phasing Imputing Analyzing Imputing Analyzing
Output files (Mach) Dosage file .. .. .. n ID TYPE SNP1 SNP2 SNP3 SNP4 SNP5 RS3->232 ML_DOSE 2 2 2 2 2 RS3->2921 ML_DOSE 2 1 2 2 2 RS3->3370 ML_DOSE 1.999 1 2 2 2 RS3->3542 ML_DOSE 2 1 2 1.968 1.998 Info file SNP Al1 Al2 Freq1 MAF Quality Rsq rs12828708 A G 0.9603 0.0397 0.9707 0.7232 rs10880855 T C 0.5149 0.4851 0.9991 0.9985 rs7979218 G A 0.9673 0.0327 0.9826 0.7903 rs7315793 C T 0.9537 0.0463 0.9554 0.6538 rs4768098 A G 0.6954 0.3046 0.9984 0.9971
Accuracy of the Imputation process needs to be assessed • Imputation Accuracy : Concordance rate between imputed genotypes and experimental genotypes • Measures of Accuracy – Estimated Accuracy : Proportion of rounds where the imputed genotype agrees with the consensus (best-guess genotype) across all rounds – Estimated r 2
Recommend
More recommend