genome wide association studies
play

Genome-Wide Association Studies Caitlin Collins , Thibaut Jombart - PowerPoint PPT Presentation

Genome-Wide Association Studies Caitlin Collins , Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014 Outline Introduction to GWAS Study design o GWAS design o


  1. Genome-Wide Association Studies Caitlin Collins , Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014

  2. Outline • Introduction to GWAS • Study design o GWAS design o Issues and considerations in GWAS • Testing for association o Univariate methods o Multivariate methods • Penalized regression methods • Factorial methods 2

  3. Genomics & GWAS 3

  4. The genomics revolution Sequencing technology • o 1977 – Sanger o 1995 – 1 st bacterial genomes • < 10,000 bases per day per machine o 2003 – 1 st human genome • > 10,000,000,000,000 bases per day per machine GWAS publications • o 2005 – 1 st GWAS o Age-related macular degeneration o 2014 – 1,991 publications o 14,342 associations Genomics & GWAS 4

  5. A few GWAS discoveries… Genomics & GWAS 5

  6. So what is GWAS? • Genome Wide Association Study o Looking for SNPs… associated with a phenotype. • Purpose: o Explain • Understanding • Mechanisms • Therapeutics o Predict • Intervention • Prevention • Understanding not required Genomics & GWAS 6

  7. Association p SNPs • Definition Controls o Any relationship between two measured quantities Cases that renders them statistically dependent. n • Heritability individuals o The proportion of variance explained by genetics o P = G + E + G*E • Heritability > 0 Genomics & GWAS 7

  8. Genomics & GWAS 8

  9. Why? • Environment, Gene-Environment interactions • Complex traits, small effects, rare variants • Gene expression levels • GWAS methodology? Genomics & GWAS 9

  10. Study Design 10

  11. GWAS design • Case-Control o Well- defined “case” o Known heritability • Variations o Quantitative phenotypic data • Eg. Height, biomarker concentrations o Explicit models • Eg. Dominant or recessive Study Design 11

  12. Issues & Considerations • Data quality o 1% rule • Controlling for confounding o Sex, age, health profile o Correlation with other variables * • Population stratification* * • Linkage disequilibrium* Study Design 12

  13. Population stratification • Definition o “Population stratification” = population structure o Systematic difference in allele frequencies btw. sub- populations … • … possibly due to different ancestry • Problem o Violates assumed population homogeneity, independent observations •  Confounding, spurious associations o Case population more likely to be related than Control population •  Over-estimation of significance of associations Study Design 13

  14. Population stratification II • Solutions o Visualise • Phylogenetics • PCA o Correct • Genomic Control • Regression on Principal Components of PCA Study Design 14

  15. Linkage disequilibrium (LD) • Definition o Alleles at separate loci are NOT independent of each other • Problem? o Too much LD is a problem •  noise >> signal o Some (predictable) LD can be beneficial •  enables use of “marker” SNPs Study Design 15

  16. Testing for Association 16

  17. Methods for association testing • Standard GWAS o Univariate methods • Incorporating interactions o Multivariate methods • Penalized regression methods (LASSO) • Factorial methods (DAPC-based FS) Testing for Association 17

  18. Univariate methods p SNPs • Approach o Individual test statistics Controls o Correction for multiple testing Cases • Variations n individuals o Testing • Fisher’s exact test, Cochran -Armitage trend test, Chi- squared test, ANOVA • Gold Standard — Fischer’s exact test o Correcting • Bonferroni • Gold Standard — FDR Testing for Association 18

  19. Univariate – Strengths & weaknesses Strengths Weaknesses Straightforward Multivariate system, • • univariate framework Computationally fast • Effect size of individual • Conservative • SNPs may be too small Easy to interpret • Marginal effects of • individual SNPs ≠ combined effects Testing for Association 19

  20. What about interactions? Testing for Association 20

  21. Interactions • Epistasis o “Deviation from linearity under a general linear model” 𝑍 𝑗 = 𝑥 0 + 𝑥 1 𝐵 𝑗 + 𝑥 2 𝐶 𝑗 +𝒙 𝟒 𝑩 𝒋 𝑪 𝒋 With p predictors, there are: • 𝑞 𝑙 • 𝑞 𝑙! k-way interactions 𝑙 = • p = 10,000,000  5 x 10 11 o That’s 500 BILLION possible pair-wise interactions! Need some way to limit the number of pairwise • interactions considered… Testing for Association 21

  22. Multivariate methods Neural Networks Penalized Regression Genetic programming Parametric LASSO penalized regression optimized neural decreasing method networks The elastic net Ridge regression Logic Trees Logic feature selection Monte Carlo Bayesian Approaches Logic regression Logic Regression Bayesian partitioning Modified Logic Bayesian Logistic Bayesian Epistasis Regression-Gene Regression with Association Mapping Expression Programming Stochastic Search Genetic Programming for Set association Variable Selection Association Studies approach Factorial Methods Non-parametric Methods Multi-factor Sparse-PCA Random forests dimensionality reduction Restricted Supervised-PCA method partitioning method DAPC-based FS Combinatorial (snpzip) Odds-ratio- partitioning method based MDR Testing for Association 22

  23. Multivariate methods (ii) • Penalized regression methods o LASSO penalized regression • Factorial methods o DAPC-based feature selection Testing for Association 23

  24. Penalized regression methods • Approach o Regression models multivariate association o Shrinkage estimation  feature selection • Variations o LASSO, Ridge, Elastic net, Logic regression • Gold Standard — LASSO penalized regression Testing for Association 24

  25. LASSO penalized regression • Regression o Generalized linear model (“ glm ”) • Penalization o L1 norm o Coefficients  0 o Feature selection! Testing for Association 25

  26. LASSO – Strengths & weaknesses Strengths Weaknesses Multicollinearity • • Stability Not designed for high-p • • Interpretability Computationally intensive • • Likely to accurately Calibration of penalty • select the most parameters influential predictors User-defined  variability o Sparsity • • Sparsity NO p-values! • Testing for Association 26

  27. Factorial methods • Approach o Place all variables (SNPs) in a multivariate space o Identify discriminant axis  best separation o Select variables with the highest contributions to that axis • Variations o Supervised-PCA, Sparse-PCA, DA, DAPC-based FS o Our focus — DAPC with feature selection (snpzip) Testing for Association 27

  28. DAPC-based feature selection a b e Alleles d c Individuals Diseased (“cases”) Healthy (“controls”) Discriminant Axis Discriminant Axis Density of individuals Density of individuals 0.5 0.4 0.3 Contribution to 0.2 Discriminant Axis 0.1 0 a b c d e Testing for Association 28 Discriminant axis Discriminant axis

  29. DAPC-based feature selection • Where should we draw the line? o  Hierarchical clustering 0.4 0.35 Density of individuals 0.3 0.25 0.2 Contribution to Discriminant Axis ? 0.15 0.1 0.05 0 Discriminant axis a b c d e

  30. Hierarchical clustering (FS) 0.5 0.4 0.3 Contribution to 0.2 Discriminant Axis 0.1 0 a b c d e Hooray! Testing for Association 30

  31. DAPC – Strengths & weaknesses Strengths Weaknesses • More likely to catch all • Sensitive to n.pca relevant SNPs (signal) • N.snps.selected varies • Computationally quick • No “p - value” • Good exploratory tool • Redundancy > sparsity • Redundancy > sparsity Testing for Association 31

  32. Conclusions • Study design o GWAS design o Issues and considerations in GWAS • Testing for association o Univariate methods o Multivariate methods • Penalized regression methods • Factorial methods 32

  33. Thanks for listening! 33

  34. Questions? 34

Recommend


More recommend