genotype imputation in genome wide association studies
play

Genotype Imputation in Genome-wide Association Studies Fernando - PowerPoint PPT Presentation

Genotype Imputation in Genome-wide Association Studies Fernando Rivadeneira 1,2 1 Department of Internal Medicine 2 Department of Epidemiology Course SNPs and Human Diseases Rotterdam November 12 th , 2018 Imputation Facilitates


  1. Genotype Imputation in Genome-wide Association Studies Fernando Rivadeneira 1,2 1 Department of Internal Medicine 2 Department of Epidemiology Course “SNP’s and Human Diseases” Rotterdam November 12 th , 2018

  2. Imputation Facilitates Meta-Analysis and has been the key of the success of GWAS • Need large sample size for the detection of moderate effects • One convenient approach: meta-analysis • However, different GWAS may use different genotyping chips (examining a different set of SNPs) – E.g., 3-way meta analysis on lipid concentrations • FUSION: Illumina 300K • DGI and SardiNIA: Affymetrix 500K – Illumina 300K and Affymetrix 500K have <10% (45K) SNPs in common

  3. Adapted Marchini & Howie . Nat. Rev. 2010 Why is imputation important? Association Study of genotyped data GWAS data with missing genotypes Imputation to a reference panel Association Study of imputed data GWAS imputed data

  4. Imputation allows integration of results across different platforms (example HMGA region) 4

  5. LDL-C association illustrates how scarce the SNP density will be when limited to shared markers SNPs typed by all 3 groups (~45K) Affy panel SNPs ( ~ 321K) Imputed SNPs (~ 2.25 million) Willer et al, Nat Genet 40: 161-9, 2008

  6. Increase in genome coverage is also facilitated by Imputation • HapMap estimated ~10M common SNPs in the human genome (>3M genotyped) • GWAS examine 100K – 1M SNPs • These SNPs can be proxies for many others – E.g. Illumina 300K covers ~78% of Hapmap • What do we achieve with imputation?

  7. With imputation we fill-in the gaps (missing information) using a reference panel • Start with … – 100 K – 1M SNPs from a GWAS – ~3M SNPs genotyped in HapMap… • ~40-60 Million SNPs sequenced in reference (HRC) • Then impute genotypes in study samples for ALL the markers in Reference Set but absent from study

  8. Imputation works by determining “shared” regions of the genome and copying over • Rationale : Even unrelated individuals typically share short stretches of chromosome • Heuristic Procedure : Identify the shared stretches of chromosome and impute by copying over

  9. Using the LD relationships between SNPs genotypes can be predicted (imputed) for SNPs not measured experimentally Observed Genotypes Study . . . . A . . . . . . . A . . . . A . . . Sample . . . . G . . . . . . . C . . . . A . . . 9

  10. Observed genotypes are phased and compared to the set of phased haplotypes of the reference Observed Genotypes Study . . . . A . . . . . . . A . . . . A . . . Sample . . . . G . . . . . . . C . . . . A . . . Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C HapMap / C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C 1K genomes T G G G A T C T C C C G A C C T C A T G C Project C G A G A T C T C C C G A C C T T G T G C C G A G A C T C T T T T C T T T T G T A C /HRC C G A G A C T C T C C G A C C T C G T G C C G A A G C T C T T T T C T C C T G T G C 10

  11. A probalistic search for mosaics among the reference is done looking for similar stretches of flanking haplotypes Observed Genotypes Study . . . . A . . . . . . . A . . . . A . . . Sample . . . . G . . . . . . . C . . . . A . . . Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C HapMap / C G A G A C T C T C C G A C C T T A T G C 1K genomes T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C Project C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C /HRC C G A A G C T C T T T T C T C C T G T G C 11

  12. Next step is impute missing genotypes from the reference using an algorithm which models each haplotype conditional on all others Observed Genotypes Study c g a g A t c t c c c g A c c t c A t g g Sample c g a a G c t c t t t t C t c t c A t g c Reference Haplotypes C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G C C A A G C T C T T T T C T T C T G T G C C G A A G C T C T T T T C T T C T G T G C HapMap / C G A G A C T C T C C G A C C T T A T G C 1K genomes T G G G A T C T C C C G A C C T C A T G C C G A G A T C T C C C G A C C T T G T G C Project C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C /HRC C G A A G C T C T T T T C T C C T G T G C 12

  13. Imputation increases power by increasing sample size ( individuals with missing genotypes ), allowing higher LD and decreasing error Observed Genotypes c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t c t c A t g c Example GG GA AA Best Reference Haplotypes 0 1 2 guess C G A G A T C T C C T T C T T C T G T G C C G A G A T C T C C C G A C C T C A T G G Allele C C A A G C T C T T T T C T T C T G T G C 0.68 dose C G A A G C T C T T T T C T T C T G T G C C G A G A C T C T C C G A C C T T A T G C T G G G A T C T C C C G A C C T C A T G C Best guess information is C G A G A T C T C C C G A C C T T G T G C never used with low C G A G A C T C T T T T C T T T T G T A C C G A G A C T C T C C G A C C T C G T G C imputation quality scores! C G A A G C T C T T T T C T C C T G T G C Little information is always better than NO information! 13

  14. Imputation works providing reliable test statics • Michigan Age-related Comparison of Test Macular Degeneration Statistics Study 150 Experimental Experimental • Dense genotyping in a 123Kb region 100 overlapping CFH • Used 11 tagSNPs to 50 predict 84 SNPs • Imputed genotypes 0 differ from the 0 50 100 150 experimental ones Imputed Imputed only <1% of the time Li et al, Nat Genet 38: 1049-54, 2006

  15. Imputation works providing reliable test statics and effect estimates Allele frequency P-value Odds ratio Imputed Genotyped Imputed Genotyped Imputed Genotyped 2.5 x 10 -6 6.3 x 10 -6 .024 .021 2.57 2.20 5.3 x 10 -6 1.1 x 10 -5 .543 .540 1.33 1.31 2.0 x 10 -5 4.1 x 10 -5 .114 .136 1.47 1.41 6.6 x 10 -5 5.5 x 10 -5 .494 .490 1.28 1.28 7.5 x 10 -5 9.0 x 10 -5 .927 .924 1.72 1.65 1.4 x 10 -4 3.9 x 10 -4 .744 .753 1.33 1.30 1.7 x 10 -4 1.2 x 10 -4 .289 .291 1.27 1.28 1.9 x 10 -4 3.6 x 10 -5 .970 .973 2.47 2.58 .401 .361 6.3 x 10 -4 1.6 x 10 -3 1.26 1.22 .817 .816 9.5 x 10 -4 1.0 x 10 -3 1.31 1.30 .605 .605 9.9 x 10 -4 1.2 x 10 -3 1.23 1.22 Scott et al, Science 316: 1341-5, 2007

  16. Imputation algorithm uses a hidden Markov model (MACH)  Hidden State S m : The pair of contributing reference haplotypes at marker m  Data G m : Observed genotypes at marker m  Goal : Infer S m

  17. Output from the imputation (MACH) g A c c t c A t g g Iteration 1 t C t t t c A t g g g A c c t c A t g g Iteration 2 Best-guess or the t C c c t c A t g c most frequently occurring g A c c t c A t g g Iteration 3 genotype guess t C c c t c A t g c across all Proportion of g A c c t c A t g g Iteration 4 iterations iterations where the t C c c t c A t g c guessed genotype agrees with the g A c c t c A t g g Consensus consensus t C c c t c A t g c Estimated fractional 1 1 3/4 1 1 1 1 1 Quality Score 3/4 3/4 count of the g A c c t c A t g g Reference Allele reference allele 1 1 2 2 2 2 2 Dosage 7/4 7/4 5/4

  18. With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation) • For large projects (n > 1,000 individuals) phasing each chromosome can take a while.. • Split data in chunks of 2000 markers (+50 in flanking regions) • (MACH) Phase each chunk independently • Ligate chunks to reconstruct chromosome (using flanking regions) • (Minimac) Impute pre-phased chunks with a particular reference (e.g. different versions of 1000G) ⇒ Prefer splitting over markers than individuals ⇒ http://genome.sph.umich.edu/wiki/Minimac:_1000_Genomes_Imputation_Cookbook

  19. With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation) Phasing & Analyzing Data QC-ing Genotyping Imputing

  20. With the advent of large sequenced reference sets the imputation pipeline has been redefined (pre-phasing and imputation) Phasing & Analyzing Data QC-ing Genotyping Imputing Imputing Analyzing Genotyping Data QC-ing Phasing Imputing Analyzing Imputing Analyzing

  21. Output files (Mach)  Dosage file .. .. .. n ID TYPE SNP1 SNP2 SNP3 SNP4 SNP5 RS3->232 ML_DOSE 2 2 2 2 2 RS3->2921 ML_DOSE 2 1 2 2 2 RS3->3370 ML_DOSE 1.999 1 2 2 2 RS3->3542 ML_DOSE 2 1 2 1.968 1.998  Info file SNP Al1 Al2 Freq1 MAF Quality Rsq rs12828708 A G 0.9603 0.0397 0.9707 0.7232 rs10880855 T C 0.5149 0.4851 0.9991 0.9985 rs7979218 G A 0.9673 0.0327 0.9826 0.7903 rs7315793 C T 0.9537 0.0463 0.9554 0.6538 rs4768098 A G 0.6954 0.3046 0.9984 0.9971

  22. Accuracy of the Imputation process needs to be assessed • Imputation Accuracy : Concordance rate between imputed genotypes and experimental genotypes • Measures of Accuracy – Estimated Accuracy : Proportion of rounds where the imputed genotype agrees with the consensus (best-guess genotype) across all rounds – Estimated r 2

Recommend


More recommend