efficient algorithms for ascertaining markers for
play

Efficient algorithms for ascertaining markers for controlling for - PowerPoint PPT Presentation

Efficient algorithms for ascertaining markers for controlling for population substructure Oscar Lao Department of Forensic Molecular Biology Erasmus Medical Center (Rotterdam) New Jersey 2009 4/27/2009 New Jersey 2009 1 Workflow 1. Human


  1. Efficient algorithms for ascertaining markers for controlling for population substructure Oscar Lao Department of Forensic Molecular Biology Erasmus Medical Center (Rotterdam) New Jersey 2009 4/27/2009 New Jersey 2009 1

  2. Workflow 1. Human population substructure ‐ How to detect it? ‐ How much? ‐ Where does it come from? 2. Why does it matter? 3. Ancestry Sensitive Markers (ASMs) / Ancestry Informative Markers (AIMs) ‐ Hypothesis driven. Particular individual clusters are preferred ‐ ASMs ‐ PhenoASMs 4/27/2009 New Jersey 2009 2

  3. Population substructure How much there is and how much can be detected. The two sides of the same coin Plato’s cave myth 4/27/2009 New Jersey 2009 3

  4. Population substructure DETECTION – STRUCTURE – BAPS – FRAPPE Sometimes results – GENELAND are NOT reproducible – PCA/MDS + K means – Neural Networks – … 4/27/2009 New Jersey 2009 4

  5. Population substructure HOW MUCH? Which type? Where? ‐ Phenotype ‐ Worldwide ‐ Genotype ‐ Regional (I will focus on Europe) Y chromosome mtDNA Autosomal markers 4/27/2009 New Jersey 2009 5

  6. 6 Phenotypic substructure New Jersey 2009 4/27/2009

  7. 7 Y chromosome New Jersey 2009 4/27/2009

  8. 8 mtDNA New Jersey 2009 4/27/2009

  9. Classical markers Cavalli ‐ Sforza et al 1994 4/27/2009 New Jersey 2009 9

  10. CEPH-HGDP panel 1064 samples 51 human populations of global distribution 4/27/2009 New Jersey 2009 10

  11. Autosomal STRs 377 autosomal markers Rosenberg et al Science 2002 993 autosomal markers Europe M. East C/S-Asia E-Asia O Ameri Africa Rosenberg et al Plos Genetics 2005 4/27/2009 New Jersey 2009 11

  12. Autosomal SNPs 650,000 SNPs (FRAPPE) 550,000 SNPs (STRUCTURE) Haplotypes Li et al Science 2008 Jakobsson et al Nature 2008 4/27/2009 New Jersey 2009 12

  13. A set of European populations 23 populations 500 Affy Array 4/27/2009 New Jersey 2009 13

  14. Autosomal SNPs in Europe 300,000 SNPs r=0.7 Lao and Lu et al Current Biology 2008 4/27/2009 New Jersey 2009 14

  15. Autosomal SNPs in Europe Novembre et al Nature 2008 4/27/2009 New Jersey 2009 15

  16. Autosomal SNPs in Europe K = 2; Admixture Correlation with latitude Correlation with longitude R 2 = 0.86 R 2 = 0.01 4/27/2009 New Jersey 2009 16

  17. Autosomal SNPs in Europe World Europe Anayet peak (2574 m), Pyrenees Keukenhoof garden( ‐ 2 m), Netherlands 4/27/2009 New Jersey 2009 17

  18. Non random distribution of population substructure EDAR (positive selection in Asians) 4/27/2009 New Jersey 2009 18

  19. Non random distribution of population substructure LCT (positive selection in North European populations) Lao and Lu et al Current Biology 2008 4/27/2009 New Jersey 2009 19

  20. Demography shapes the population substructure Cavalli ‐ Sforza & Feldman Nature Genetics 2003 Simoni et al AJHG 2000 4/27/2009 New Jersey 2009 20

  21. Selection shapes the population substructure • Selective pressures within the species (locus specific) Lactose tolerance Malaria resistence Human pigmentation … 4/27/2009 New Jersey 2009 21

  22. Selection shapes the population substructure • Population substructure & pigmentation (5 SNPs) 1 MDS 0.5 Europe Africa 0 Middle East Oceania -0.5 America -1 C/SAsia EAsia -1.5 N Africa -1 -0.5 0 0.5 1 1.5 5 to 12 Biasutti skin 12 to 16 pigmentation 16 to 20 Lao et al Ann Hum Genet 2007 20 to 24 units 24 to 30 4/27/2009 New Jersey 2009 22

  23. Why population substructure: Confounding factor CASES CONTROLS A G 4/27/2009 New Jersey 2009 23

  24. Population substructure: improving the detection Plato’s cave myth CHANGE THE ALGORITHM FOR DETECTING POPULATION SUBSTRUCTURE 4/27/2009 New Jersey 2009 24

  25. Population substructure: improving the detection Plato’s cave myth INCREASE THE RESOLUTION TO SEE THE OBJECTS 4/27/2009 New Jersey 2009 25

  26. AIMs/ASMs • Markers that capture most of the genetic ancestry – Estimate ancestry – Reduce the number of markers to test for genetic homogeneity • Time cost (clustering algorithms can be extremely computational intensive) • Economical cost (i.e exclude individuals BEFORE doing the GWA) 4/27/2009 New Jersey 2009 26

  27. Strategies to ascertain ASMs • Based on the existing diversity between individuals (i.e Paschou et al 2008) • Based on predefined groups of individuals – No phenotype linked • Large Genetic distances • Signals of positive selection – Phenotype linked • Covariates with the phenotype of interest 4/27/2009 New Jersey 2009 27

  28. A basic algorithm to ascertain ASMs • Use a statistic to quantify the amount of differentiation between populations • Compute the OVERAL non ‐ redundant amount of In between set of SNPs • Take the best combination of markers from all the possible combinations • Repeat the process until the information of the set of markers is maximum 4/27/2009 New Jersey 2009 28

  29. A statistic to ascertain ASMs informativeness for assignment ⎛ ⎞ ) ∑ N K p ( ∑ ⎜ ⎟ = − + ij ; log log I Q J p p p ⎜ ⎟ n j j ij ⎝ ⎠ K = = 1 1 j i Am J Hum Genet. 2003 Dec;73(6):1402-22 4/27/2009 New Jersey 2009 29

  30. A statistic to ascertain ASMs • How much information a marker contains about the ancestry of one individual (measured in nats ) • Ranges from 0 to the natural logarithm of the number of clusters and it is proportional to the number of differentiated clusters 4/27/2009 New Jersey 2009 30

  31. A statistic to ascertain ASMs • Computes the non ‐ redundant amount of information when considering more than one marker • Requires computing the frequency of ALL the allelic combinations when considering more than 1 locus 4/27/2009 New Jersey 2009 31

  32. A way to compute In • Problem: The number of combinations increases exponentially with the number of markers. – Number of allelic combinations considering 50 SNPs: 2 50 = 1,125,899, 906,842,62 4 4/27/2009 New Jersey 2009 32

  33. A way to compute In ⎛ ⎞ ) ∑ p N K ( ∑ ⎜ ⎟ = − + ij ; log log I Q J p p p ⎜ ⎟ n j j ij ⎝ ⎠ K = = 1 1 j i ⎛ ⎞ ) ∑ N K H ( ∑ ⎜ ⎟ = − ij ; I Q J H ⎜ ⎟ n j ⎝ ⎠ K = = 1 1 j i N 1 ( ) ∑ By applying the ≈ ln H p Asymptotic Equipartition N = 1 i Property of Entropy 4/27/2009 New Jersey 2009 33

  34. A method to ascertain ASMs • Problem: Considering 8,000 markers, ascertaining the best set of 50 markers requires computing : 8,000! = ≈ × 130 4 10 N ( ) − combinatio ns 50 ! 8 , 000 50 ! 4/27/2009 New Jersey 2009 34

  35. A method to ascertain ASMs Population of answers Select best answers (>I n ) Next generation Random mating Recombination/Mutation 4/27/2009 New Jersey 2009 35

  36. ASMs for continental differentiation using Affy 10k CEPH-HGDP panel YCC-panel 1064 samples 51 human populations of global distribution 76 human individuals 21 sampling localities Reproducibility of geographic structure in a different dataset SNP ascertainment (10 SNPs) (10 SNPs) Perlegen Test for Database signatures of positive 3 Human populations 10k Affymetrix Array selection ~1,500,000 SNPs (~9000 SNPs after (EHH test) (most informative excluding X-SNPs & 5 SNPs) missing SNPs) Lao et al. Am J Hum Genet. 2006 Apr;78(4):680 ‐ 90 4/27/2009 New Jersey 2009 36

  37. ASMs for continental differentiation using Affy 10k The genetic algorithm was Selected SNPs in the final 10 applied increasing every time the SNPs run number of selected SNPs 100 90 Marker Chromosome Gene name I N (%) from 4 I N (%) from 7 name groups YCC groups CEPH- 80 panel HGDP 14 VRK1 29.066 7.960 rs722869 p e r c e n t a g e o f i n f o r m a t io n 70 rs1858465 17 25.637 9.228 60 rs1876482 2 LOC442008 24.589 10.290 rs1344870 3 22.810 11.074 50 rs1363448 5 PCDHGB1 19.418 4.552 2 ABCA12 18.739 9.472 rs952718 40 rs2352476 7 18.317 5.603 11 18.083 6.157 30 rs714857 rs1823718 15 17.845 5.451 20 15 RYR3 14.315 5.530 rs735612 10 Lao et al. Am J Hum Genet. 2006 Apr;78(4):680 ‐ 90 0 1 2 3 4 5 6 7 8 9 10 Number of SNPs 4/27/2009 New Jersey 2009 37

  38. ASMs for continental differentiation ASMs for continental differentiation using Affy 10k 993 autosomal markers 10 SNPs No admixture Admixture Ameri O E-Asia C/S-Asia Europe M. East Africa Lao et al. Am J Hum Genet. 2006 Apr;78(4):680 ‐ 90 4/27/2009 New Jersey 2009 38

  39. ASMs for continental differentiation using HapMap III K = 6 (1000 (randomly ascertained) markers, Admixture, 10,000 burning, 10,000 retained simulations) Africa Sub ‐ Saharan Europe India East Asia Am K = 5 (50 markers, Admixture, 500,000 burning, 500,000 retained simulations) K = 6 (100 markers, Admixture, 100,000 burning, 100,000 retained simulations) 4/27/2009 New Jersey 2009 39

  40. ASMs for continental differentiation using Illumina 650k 25 ascertained markers. PCA E Asia Oceania Africa Europe Middle East Central Asia Amerindians 4/27/2009 New Jersey 2009 40

  41. ASMs for continental differentiation using Illumina 650k • CEPH 550,000 SNPs K = 5 (50 ascertained markers, Admixture, 500,000 burning, 500,000 retained simulations) 4/27/2009 New Jersey 2009 41

Recommend


More recommend