Genome-wide association studies Fernando Rivadeneira MD PhD 1,2 1 Department of Internal Medicine 2 Department of Epidemiology SNPs and Diseases Molecular School of Medicine Monday, November 12 th , 2018
Topic outline - Rationale GWAS Approach - Technology and QC - Study design - Study populations - Test for association - Population Stratification - Imputation (next talk) - Power - Phenotype definition - Follow-up GWAS signals
Topic outline - Rationale GWAS Approach - Technology and QC - Study design - Study populations - Test for association - Population Stratification - Imputation (next talk) - Power - Phenotype definition - Follow-up studies and prospects
What is linkage disequilibrium (LD)? • Co-occurrence of alleles at distinct/adjacent loci more frequently than expected by the allele frequencies and recombination rate • Allellic association depends on: SNP1 -G or A 1)physical distance (debate?) G → A 2)population history of sample SNP2 -C C → T 3)age of mutation/allele SNP3 -A G → A
Identifying common variants associated to common traits and diseases is often targeted using the principles of: Linkage disequilibrium mapping M1 (SNP1) D (SNP, DIP, CNV) common, complex M2 (SNP2) (association) CD/CV common
Linkage disequilibrium (LD) is the basis of the haplotype block structure What is an haplotype? • Linear, ordered arrangement of alleles on a chromosome • Combination of alleles of different polymorphisms on a single chromosome Ancestor Present-day Region in LD
Genetic variation is structured into blocks of high LD:
LD Statistics in practice • r 2 is inversely related to sample size of genetic association studies 1/r 2 1,000 cases 1,250 cases 1,000 controls r 2 =1.0 1,250 controls r 2 = 0.80 • D ´ is related to recombination history D ´ ~ 1 no recombination D ´ < 1 (0.8) historical recombination • D ’ and r 2 are complementary D ´ = 1 when r 2 is low (i.e. 0.02)
Haplotype structure in the absence of recombination • In the absence of recombination, the shape of the tree and where mutations fall on it determine patterns of haplotype structure • Two mutations on the same branch will be in complete association, mutations on different branches will have lower and often low association r 2 = 1 r 2 = 0.04
LD information allows to pick selected variants that “tag” variation in haplotypes G/C A/T G/A T/C G/C A/C Tags: 2 3 1 4 5 6 SNP 1 SNP 3 SNP 6 A G G T T A G G G C C C A G G C C C 3 in total T A A C C C G G G T A A C C C C C C Test for association: SNP 1 high r 2 high r 2 high r 2 SNP 3 SNP 6 After Carlson et al. (2004) AJHG 74 :106
LD information allows to pick selected variants that “tag” variation in haplotypes G/A G/C T/C G/C A/T A/C Tags: 3 2 4 5 1 6 SNP 1 SNP 3 A G G G T T G G A A 2 in total A G G C C C C C C C T A A G C C G G C C C C C T A A C C C C Test for association: tags in multi-marker test should be SNP 1 captures 1+2 in high LD in order to avoid SNP 3 captures 3+5 overfitting “AG” haplotype captures SNP 4+6
Properties underlying the haplotype-block structure • Regions of extensive Linkage disequilibrium and reduced haplotype diversity • Within a block SNPs are not independent • Haplotype-tag SNPs (htSNPs) are the subset of SNPs that can capture most of the haplotype diversity
Genetic architecture fully determined by allele frequency and penetrance (effect size) of variants Rivadeneira & Makitie TEM 2016
Genome-wide association (GWA) combines the strongest properties of linkage (hypothesis-free) and association (power) designs Genetic architecture of traits rare, monogenic (linkage) big Few examples Hypothesis- Effect Size free approach common, complex small Probably real (association) (impossible to identify with current methods) rare common Frequency Genetic Variant Modified from McCarthy et al., Nat Genet Rev 2008
Genome-wide association (GWA) has been facilitated by the advent of: Of 3,000,000,000 bases Of 3,000,000,000 bases in human genome in human genome ~10,000,000 positions ~10,000,000 positions show variation show variation ~4,000,000 catalogued as common variation ~4,000,000 catalogued ~2,200,000 in CEU as common variation ~2,200,000 in CEU ~80-90% are captured by typing 500K markers ~80-90% are captured by typing 500K markers *from Mark McCarthy
Topic outline - Rationale GWAS Approach - Technology and QC - Study design - Study populations - Test for association - Population Stratification - Imputation (next talk) - Power - Phenotype definition - Follow-up studies and prospects
Microarray technology allows to genotype in the same effort hundred of thousands of SNPs per individual… AA AB BB AA → SNP 1 AA BB → SNP 2 BB AB → SNP 3 . . AB . . . . AB → SNP 500,000
… which in the setting of large epidemiological studies allows the simultaneous testing of 2.5 million (imputed) markers for association with traits AA AB BB AA → SNP 1 AA BB → SNP 2 BB AB → SNP 3 . . AB . . . . AB → SNP 500,000
This first step of the GWA approach is merely a hypothesis generating phase (with some very few exceptions) AA AB BB AA → SNP 1 AA BB → SNP 2 BB AB → SNP 3 . . 14 18 X AB 1 2 3 4 5 6 7 8 . . 10 12 . . Chromosomes AB → SNP 500,000
The crucial step is replication which allows building-up evidence for association (genome-wide significance) AA AB BB p<0.05 threshold results in ~20,000 hypotheses AA → SNP 1 AA BB → SNP 2 BB AB → SNP 3 . . 14 18 X AB 1 2 3 4 5 6 7 8 . . 10 12 . . AB → SNP 500,000 Follow-up Set Meta-analysis of Top SNPs full datasets
Only a selected number of SNPs is expected to achieve REPLICATION reaching a genome wide-significant level (i.e. 5 x 10 -8 ) Population stratification
Quality Control Genotyping
Rotterdam Study datasets QC methods description MAF> 1% GT SNPs: 512,849 RS-I Sample call rate < 98% Call Rate > 98% 466,389 RS-II Missing DNA pHWE > 1x10 -6 514,073 RS-III Gender mismatch Excess autosomal heterozigocity Imputed SNPs: 2,543,887 Duplicates or family relations IBS>97% Ethnic outliers (IBS distances > 4SD) Missing traits 24
Topic outline - Rationale GWAS Approach - Technology and QC - Study design - Study populations - Test for association - Population Stratification - Imputation (next talk) - Power - Phenotype definition - Follow-up GWAS signals
Type of study designs common variants - Relatedness base - Phenotype base - Family (extended pedigrees, - Case enrichment pedigrees, trios, sibs) - Extreme truncates - Unrelated individuals - Super/shared controls - Sampling base - Genetics base - Population-based - Genetic load enrichment - Disease oriented (case/control, - Isolates (extended LD) proband families) - Ethnicity - Epidemiological base - Admixture - Case/control - Genotype platform base - Cross-sectional - Staged approach (Gen) - Cohort (follow-up) - Joint analysis (Imp)
Examples types of GWA studies • Disease oriented case/control studies – WTCCC, FUSION • Diseased oriented population-based studies – FRAMINGHAM HEART STUDY • Population-based Studies – ROTTERDAM STUDY – Generation R STUDY • Mega-GWAS – UKBIOBANK – MVP
Most (if not all) GWA activities occur within CONSORTIA summing tenths to hundreds of thousands of participants CHARGE Rotterdam Study GEnetic Factors of OSteoporosis GENETIC INVESTIGATIONS OF ANTHROPOMETRIC TRAITS
Topic outline - Rationale GWAS Approach - Technology and QC - Study design - Study populations - Test for association - Population Stratification - Imputation (next talk) - Power - Phenotype definition - Follow-up GWAS signals
Different genetic models do influence the power of analysis but are difficult to determine a-priori => To avoid multiple testing problems the first genetic analyses are usually run using additive models which preserve power across different scenarios
Statistical Methods Traits: Disease state or QT in natural units QT-> Standardized age-adjusted residuals from gender- stratified regression Trait = α + βAge + βAge2 Imputation: MACH, IMPUTE, BIM-BAM, PLINK r2>0.3, ratio Obs/Exp variance > 0.01, MAF > 0.01, HWE? Minor allele from HapMap CEU (+) strand => Reference Analysis: Performed by each cohort: MACH2QTL/BIN, SNPTEST, ProbABEL, PLINK Adjustment population stratification => Genomic control λ < 1.05, corrected SE = SE * √ λ Meta-analysis: METAL, PLINK, MetABEL: inverse variance weighted standard: fixed effects Heterogeneity: random effects for variants with I 2 > 50 GWS α < 5 x 10 -8 after double GC correction Significance:
Topic outline - Rationale GWAS Approach - Technology and QC - Study design - Study populations - Test for association - Population Stratification - Imputation (next talk) - Power - Phenotype definition - Follow-up GWAS signals
Recommend
More recommend