Efficient algorithms for ascertaining markers for controlling for population substructure Oscar Lao Department of Forensic Molecular Biology Erasmus Medical Center (Rotterdam) New Jersey 2009 4/27/2009 New Jersey 2009 1
Workflow 1. Human population substructure ‐ How to detect it? ‐ How much? ‐ Where does it come from? 2. Why does it matter? 3. Ancestry Sensitive Markers (ASMs) / Ancestry Informative Markers (AIMs) ‐ Hypothesis driven. Particular individual clusters are preferred ‐ ASMs ‐ PhenoASMs 4/27/2009 New Jersey 2009 2
Population substructure How much there is and how much can be detected. The two sides of the same coin Plato’s cave myth 4/27/2009 New Jersey 2009 3
Population substructure DETECTION – STRUCTURE – BAPS – FRAPPE Sometimes results – GENELAND are NOT reproducible – PCA/MDS + K means – Neural Networks – … 4/27/2009 New Jersey 2009 4
Population substructure HOW MUCH? Which type? Where? ‐ Phenotype ‐ Worldwide ‐ Genotype ‐ Regional (I will focus on Europe) Y chromosome mtDNA Autosomal markers 4/27/2009 New Jersey 2009 5
6 Phenotypic substructure New Jersey 2009 4/27/2009
7 Y chromosome New Jersey 2009 4/27/2009
8 mtDNA New Jersey 2009 4/27/2009
Classical markers Cavalli ‐ Sforza et al 1994 4/27/2009 New Jersey 2009 9
CEPH-HGDP panel 1064 samples 51 human populations of global distribution 4/27/2009 New Jersey 2009 10
Autosomal STRs 377 autosomal markers Rosenberg et al Science 2002 993 autosomal markers Europe M. East C/S-Asia E-Asia O Ameri Africa Rosenberg et al Plos Genetics 2005 4/27/2009 New Jersey 2009 11
Autosomal SNPs 650,000 SNPs (FRAPPE) 550,000 SNPs (STRUCTURE) Haplotypes Li et al Science 2008 Jakobsson et al Nature 2008 4/27/2009 New Jersey 2009 12
A set of European populations 23 populations 500 Affy Array 4/27/2009 New Jersey 2009 13
Autosomal SNPs in Europe 300,000 SNPs r=0.7 Lao and Lu et al Current Biology 2008 4/27/2009 New Jersey 2009 14
Autosomal SNPs in Europe Novembre et al Nature 2008 4/27/2009 New Jersey 2009 15
Autosomal SNPs in Europe K = 2; Admixture Correlation with latitude Correlation with longitude R 2 = 0.86 R 2 = 0.01 4/27/2009 New Jersey 2009 16
Autosomal SNPs in Europe World Europe Anayet peak (2574 m), Pyrenees Keukenhoof garden( ‐ 2 m), Netherlands 4/27/2009 New Jersey 2009 17
Non random distribution of population substructure EDAR (positive selection in Asians) 4/27/2009 New Jersey 2009 18
Non random distribution of population substructure LCT (positive selection in North European populations) Lao and Lu et al Current Biology 2008 4/27/2009 New Jersey 2009 19
Demography shapes the population substructure Cavalli ‐ Sforza & Feldman Nature Genetics 2003 Simoni et al AJHG 2000 4/27/2009 New Jersey 2009 20
Selection shapes the population substructure • Selective pressures within the species (locus specific) Lactose tolerance Malaria resistence Human pigmentation … 4/27/2009 New Jersey 2009 21
Selection shapes the population substructure • Population substructure & pigmentation (5 SNPs) 1 MDS 0.5 Europe Africa 0 Middle East Oceania -0.5 America -1 C/SAsia EAsia -1.5 N Africa -1 -0.5 0 0.5 1 1.5 5 to 12 Biasutti skin 12 to 16 pigmentation 16 to 20 Lao et al Ann Hum Genet 2007 20 to 24 units 24 to 30 4/27/2009 New Jersey 2009 22
Why population substructure: Confounding factor CASES CONTROLS A G 4/27/2009 New Jersey 2009 23
Population substructure: improving the detection Plato’s cave myth CHANGE THE ALGORITHM FOR DETECTING POPULATION SUBSTRUCTURE 4/27/2009 New Jersey 2009 24
Population substructure: improving the detection Plato’s cave myth INCREASE THE RESOLUTION TO SEE THE OBJECTS 4/27/2009 New Jersey 2009 25
AIMs/ASMs • Markers that capture most of the genetic ancestry – Estimate ancestry – Reduce the number of markers to test for genetic homogeneity • Time cost (clustering algorithms can be extremely computational intensive) • Economical cost (i.e exclude individuals BEFORE doing the GWA) 4/27/2009 New Jersey 2009 26
Strategies to ascertain ASMs • Based on the existing diversity between individuals (i.e Paschou et al 2008) • Based on predefined groups of individuals – No phenotype linked • Large Genetic distances • Signals of positive selection – Phenotype linked • Covariates with the phenotype of interest 4/27/2009 New Jersey 2009 27
A basic algorithm to ascertain ASMs • Use a statistic to quantify the amount of differentiation between populations • Compute the OVERAL non ‐ redundant amount of In between set of SNPs • Take the best combination of markers from all the possible combinations • Repeat the process until the information of the set of markers is maximum 4/27/2009 New Jersey 2009 28
A statistic to ascertain ASMs informativeness for assignment ⎛ ⎞ ) ∑ N K p ( ∑ ⎜ ⎟ = − + ij ; log log I Q J p p p ⎜ ⎟ n j j ij ⎝ ⎠ K = = 1 1 j i Am J Hum Genet. 2003 Dec;73(6):1402-22 4/27/2009 New Jersey 2009 29
A statistic to ascertain ASMs • How much information a marker contains about the ancestry of one individual (measured in nats ) • Ranges from 0 to the natural logarithm of the number of clusters and it is proportional to the number of differentiated clusters 4/27/2009 New Jersey 2009 30
A statistic to ascertain ASMs • Computes the non ‐ redundant amount of information when considering more than one marker • Requires computing the frequency of ALL the allelic combinations when considering more than 1 locus 4/27/2009 New Jersey 2009 31
A way to compute In • Problem: The number of combinations increases exponentially with the number of markers. – Number of allelic combinations considering 50 SNPs: 2 50 = 1,125,899, 906,842,62 4 4/27/2009 New Jersey 2009 32
A way to compute In ⎛ ⎞ ) ∑ p N K ( ∑ ⎜ ⎟ = − + ij ; log log I Q J p p p ⎜ ⎟ n j j ij ⎝ ⎠ K = = 1 1 j i ⎛ ⎞ ) ∑ N K H ( ∑ ⎜ ⎟ = − ij ; I Q J H ⎜ ⎟ n j ⎝ ⎠ K = = 1 1 j i N 1 ( ) ∑ By applying the ≈ ln H p Asymptotic Equipartition N = 1 i Property of Entropy 4/27/2009 New Jersey 2009 33
A method to ascertain ASMs • Problem: Considering 8,000 markers, ascertaining the best set of 50 markers requires computing : 8,000! = ≈ × 130 4 10 N ( ) − combinatio ns 50 ! 8 , 000 50 ! 4/27/2009 New Jersey 2009 34
A method to ascertain ASMs Population of answers Select best answers (>I n ) Next generation Random mating Recombination/Mutation 4/27/2009 New Jersey 2009 35
ASMs for continental differentiation using Affy 10k CEPH-HGDP panel YCC-panel 1064 samples 51 human populations of global distribution 76 human individuals 21 sampling localities Reproducibility of geographic structure in a different dataset SNP ascertainment (10 SNPs) (10 SNPs) Perlegen Test for Database signatures of positive 3 Human populations 10k Affymetrix Array selection ~1,500,000 SNPs (~9000 SNPs after (EHH test) (most informative excluding X-SNPs & 5 SNPs) missing SNPs) Lao et al. Am J Hum Genet. 2006 Apr;78(4):680 ‐ 90 4/27/2009 New Jersey 2009 36
ASMs for continental differentiation using Affy 10k The genetic algorithm was Selected SNPs in the final 10 applied increasing every time the SNPs run number of selected SNPs 100 90 Marker Chromosome Gene name I N (%) from 4 I N (%) from 7 name groups YCC groups CEPH- 80 panel HGDP 14 VRK1 29.066 7.960 rs722869 p e r c e n t a g e o f i n f o r m a t io n 70 rs1858465 17 25.637 9.228 60 rs1876482 2 LOC442008 24.589 10.290 rs1344870 3 22.810 11.074 50 rs1363448 5 PCDHGB1 19.418 4.552 2 ABCA12 18.739 9.472 rs952718 40 rs2352476 7 18.317 5.603 11 18.083 6.157 30 rs714857 rs1823718 15 17.845 5.451 20 15 RYR3 14.315 5.530 rs735612 10 Lao et al. Am J Hum Genet. 2006 Apr;78(4):680 ‐ 90 0 1 2 3 4 5 6 7 8 9 10 Number of SNPs 4/27/2009 New Jersey 2009 37
ASMs for continental differentiation ASMs for continental differentiation using Affy 10k 993 autosomal markers 10 SNPs No admixture Admixture Ameri O E-Asia C/S-Asia Europe M. East Africa Lao et al. Am J Hum Genet. 2006 Apr;78(4):680 ‐ 90 4/27/2009 New Jersey 2009 38
ASMs for continental differentiation using HapMap III K = 6 (1000 (randomly ascertained) markers, Admixture, 10,000 burning, 10,000 retained simulations) Africa Sub ‐ Saharan Europe India East Asia Am K = 5 (50 markers, Admixture, 500,000 burning, 500,000 retained simulations) K = 6 (100 markers, Admixture, 100,000 burning, 100,000 retained simulations) 4/27/2009 New Jersey 2009 39
ASMs for continental differentiation using Illumina 650k 25 ascertained markers. PCA E Asia Oceania Africa Europe Middle East Central Asia Amerindians 4/27/2009 New Jersey 2009 40
ASMs for continental differentiation using Illumina 650k • CEPH 550,000 SNPs K = 5 (50 ascertained markers, Admixture, 500,000 burning, 500,000 retained simulations) 4/27/2009 New Jersey 2009 41
Recommend
More recommend