Efficient algorithms for ascertaining markers for controlling for - - PowerPoint PPT Presentation

efficient algorithms for ascertaining markers for
SMART_READER_LITE
LIVE PREVIEW

Efficient algorithms for ascertaining markers for controlling for - - PowerPoint PPT Presentation

Efficient algorithms for ascertaining markers for controlling for population substructure Oscar Lao Department of Forensic Molecular Biology Erasmus Medical Center (Rotterdam) New Jersey 2009 4/27/2009 New Jersey 2009 1 Workflow 1. Human


slide-1
SLIDE 1

Efficient algorithms for ascertaining markers for controlling for population substructure

Oscar Lao Department of Forensic Molecular Biology Erasmus Medical Center (Rotterdam) New Jersey 2009

4/27/2009 1 New Jersey 2009

slide-2
SLIDE 2

Workflow

  • 1. Human population substructure

‐ How to detect it? ‐ How much? ‐ Where does it come from?

  • 2. Why does it matter?
  • 3. Ancestry Sensitive Markers (ASMs) / Ancestry

Informative Markers (AIMs) ‐ Hypothesis driven. Particular individual clusters are preferred

‐ ASMs ‐ PhenoASMs

4/27/2009 2 New Jersey 2009

slide-3
SLIDE 3

How much there is and how much can be

  • detected. The two sides of the same coin

4/27/2009 3 New Jersey 2009

Population substructure

Plato’s cave myth

slide-4
SLIDE 4

– STRUCTURE – BAPS – FRAPPE – GENELAND – PCA/MDS + K means – Neural Networks – … Sometimes results are NOT reproducible

4/27/2009 4 New Jersey 2009

Population substructure

DETECTION

slide-5
SLIDE 5

Which type?

‐ Phenotype ‐ Genotype

Y chromosome mtDNA Autosomal markers

Where?

‐ Worldwide ‐ Regional (I will focus

  • n Europe)

4/27/2009 5 New Jersey 2009

Population substructure

HOW MUCH?

slide-6
SLIDE 6

4/27/2009 6 New Jersey 2009

Phenotypic substructure

slide-7
SLIDE 7

4/27/2009 7 New Jersey 2009

Y chromosome

slide-8
SLIDE 8

4/27/2009 8 New Jersey 2009

mtDNA

slide-9
SLIDE 9

Cavalli‐Sforza et al 1994

4/27/2009 9 New Jersey 2009

Classical markers

slide-10
SLIDE 10

1064 samples 51 human populations of global distribution

4/27/2009 10 New Jersey 2009

CEPH-HGDP panel

slide-11
SLIDE 11

993 autosomal markers 377 autosomal markers

  • M. East

Africa Europe C/S-Asia E-Asia Ameri O

4/27/2009 11 New Jersey 2009

Autosomal STRs

Rosenberg et al Science 2002 Rosenberg et al Plos Genetics 2005

slide-12
SLIDE 12

Li et al Science 2008 650,000 SNPs (FRAPPE) Jakobsson et al Nature 2008 550,000 SNPs (STRUCTURE) Haplotypes

4/27/2009 12 New Jersey 2009

Autosomal SNPs

slide-13
SLIDE 13

23 populations 500 Affy Array

4/27/2009 13 New Jersey 2009

A set of European populations

slide-14
SLIDE 14

Lao and Lu et al Current Biology 2008 300,000 SNPs r=0.7

4/27/2009 14 New Jersey 2009

Autosomal SNPs in Europe

slide-15
SLIDE 15

Novembre et al Nature 2008

4/27/2009 15 New Jersey 2009

Autosomal SNPs in Europe

slide-16
SLIDE 16

K = 2; Admixture Correlation with latitude R2 = 0.86 Correlation with longitude R2 = 0.01

4/27/2009 16 New Jersey 2009

Autosomal SNPs in Europe

slide-17
SLIDE 17

World Europe

Anayet peak (2574 m), Pyrenees Keukenhoof garden(‐2 m), Netherlands

4/27/2009 17 New Jersey 2009

Autosomal SNPs in Europe

slide-18
SLIDE 18

EDAR (positive selection in Asians)

4/27/2009 18 New Jersey 2009

Non random distribution of population substructure

slide-19
SLIDE 19

LCT (positive selection in North European populations) Lao and Lu et al Current Biology 2008

4/27/2009 19 New Jersey 2009

Non random distribution of population substructure

slide-20
SLIDE 20

Cavalli‐Sforza & Feldman Nature Genetics 2003 Simoni et al AJHG 2000

4/27/2009 20 New Jersey 2009

Demography shapes the population substructure

slide-21
SLIDE 21
  • Selective pressures within the species (locus

specific) Lactose tolerance Malaria resistence Human pigmentation …

4/27/2009 21 New Jersey 2009

Selection shapes the population substructure

slide-22
SLIDE 22
  • Population substructure & pigmentation (5 SNPs)
  • 1
  • 0.5
0.5 1 1.5
  • 1.5
  • 1
  • 0.5
0.5 1

Europe Africa Middle East Oceania America C/SAsia EAsia N Africa

MDS

5 to 12 12 to 16 16 to 20 20 to 24 24 to 30 Biasutti skin pigmentation units

Lao et al Ann Hum Genet 2007

4/27/2009 22 New Jersey 2009

Selection shapes the population substructure

slide-23
SLIDE 23

CASES CONTROLS A G

4/27/2009 23 New Jersey 2009

Why population substructure: Confounding factor

slide-24
SLIDE 24

Plato’s cave myth CHANGE THE ALGORITHM FOR DETECTING POPULATION SUBSTRUCTURE

4/27/2009 24 New Jersey 2009

Population substructure: improving the detection

slide-25
SLIDE 25

Plato’s cave myth

4/27/2009 25 New Jersey 2009

Population substructure: improving the detection

INCREASE THE RESOLUTION TO SEE THE OBJECTS

slide-26
SLIDE 26
  • Markers that capture most of the genetic

ancestry

– Estimate ancestry – Reduce the number of markers to test for genetic homogeneity

  • Time cost (clustering algorithms can be extremely

computational intensive)

  • Economical cost (i.e exclude individuals BEFORE doing

the GWA)

4/27/2009 26 New Jersey 2009

AIMs/ASMs

slide-27
SLIDE 27
  • Based on the existing diversity between

individuals (i.e Paschou et al 2008)

  • Based on predefined groups of individuals

– No phenotype linked

  • Large Genetic distances
  • Signals of positive selection

– Phenotype linked

  • Covariates with the phenotype of interest

4/27/2009 27 New Jersey 2009

Strategies to ascertain ASMs

slide-28
SLIDE 28
  • Use a statistic to quantify the amount of

differentiation between populations

  • Compute the OVERAL non‐redundant amount
  • f In between set of SNPs
  • Take the best combination of markers from all

the possible combinations

  • Repeat the process until the information of

the set of markers is maximum

4/27/2009 28 New Jersey 2009

A basic algorithm to ascertain ASMs

slide-29
SLIDE 29

( ) ∑

= =

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − =

N j K i ij ij j j n

p K p p p J Q I

1 1

log log ;

Am J Hum Genet. 2003 Dec;73(6):1402-22

informativeness for assignment

4/27/2009 29 New Jersey 2009

A statistic to ascertain ASMs

slide-30
SLIDE 30
  • How much information a marker contains

about the ancestry of one individual (measured in nats)

  • Ranges from 0 to the natural logarithm of the

number of clusters and it is proportional to the number of differentiated clusters

4/27/2009 30 New Jersey 2009

A statistic to ascertain ASMs

slide-31
SLIDE 31
  • Computes the non‐redundant amount of

information when considering more than one marker

  • Requires computing the frequency of ALL the

allelic combinations when considering more than 1 locus

4/27/2009 31 New Jersey 2009

A statistic to ascertain ASMs

slide-32
SLIDE 32
  • Problem: The number of combinations

increases exponentially with the number of markers.

– Number of allelic combinations considering 50 SNPs:

4 906,842,62 1,125,899, 250 =

4/27/2009 32 New Jersey 2009

A way to compute In

slide-33
SLIDE 33

( ) ∑

= =

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − =

N j K i ij j n

K H H J Q I

1 1

;

( )

=

N i

p N H

1

ln 1

By applying the Asymptotic Equipartition Property of Entropy

( ) ∑

= =

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + − =

N j K i ij ij j j n

p K p p p J Q I

1 1

log log ;

4/27/2009 33 New Jersey 2009

A way to compute In

slide-34
SLIDE 34
  • Problem: Considering 8,000

markers, ascertaining the best set of 50 markers requires computing :

( )

130

10 4 ! 50 000 , 8 ! 50 8,000! × ≈ − =

ns combinatio

N

4/27/2009 34 New Jersey 2009

A method to ascertain ASMs

slide-35
SLIDE 35

Population of answers Select best answers (>In) Random mating Recombination/Mutation Next generation

4/27/2009 35 New Jersey 2009

A method to ascertain ASMs

slide-36
SLIDE 36

10k Affymetrix Array (~9000 SNPs after excluding X-SNPs & missing SNPs)

SNP ascertainment (10 SNPs)

(10 SNPs)

YCC-panel

76 human individuals 21 sampling localities

CEPH-HGDP panel

1064 samples 51 human populations of global distribution

Perlegen Database

3 Human populations ~1,500,000 SNPs

(most informative 5 SNPs)

Reproducibility

  • f geographic

structure in a different dataset Test for signatures of positive selection (EHH test)

Lao et al. Am J Hum Genet. 2006 Apr;78(4):680‐90

4/27/2009 36 New Jersey 2009

ASMs for continental differentiation using Affy 10k

slide-37
SLIDE 37 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 Number of SNPs p e r c e n t a g e o f i n f o r m a t io n

The genetic algorithm was applied increasing every time the number of selected SNPs

Marker name Chromosome Gene name IN (%) from 4 groups YCC panel IN (%) from 7 groups CEPH- HGDP rs722869 14 VRK1 29.066 7.960 rs1858465 17 25.637 9.228 rs1876482 2 LOC442008 24.589 10.290 rs1344870 3 22.810 11.074 rs1363448 5 PCDHGB1 19.418 4.552 rs952718 2 ABCA12 18.739 9.472 rs2352476 7 18.317 5.603 rs714857 11 18.083 6.157 rs1823718 15 17.845 5.451 rs735612 15 RYR3 14.315 5.530

Selected SNPs in the final 10 SNPs run

Lao et al. Am J Hum Genet. 2006 Apr;78(4):680‐90

4/27/2009 37 New Jersey 2009

ASMs for continental differentiation using Affy 10k

slide-38
SLIDE 38

Europe Africa

  • M. East

C/S-Asia E-Asia Ameri O 10 SNPs No admixture Admixture 993 autosomal markers

Lao et al. Am J Hum Genet. 2006 Apr;78(4):680‐90

4/27/2009 38 New Jersey 2009

ASMs for continental differentiation

ASMs for continental differentiation using Affy 10k

slide-39
SLIDE 39

K = 6 (1000 (randomly ascertained) markers, Admixture, 10,000 burning, 10,000 retained simulations) Africa Sub‐Saharan Europe India East Asia Am

K = 5 (50 markers, Admixture, 500,000 burning, 500,000 retained simulations) K = 6 (100 markers, Admixture, 100,000 burning, 100,000 retained simulations)

4/27/2009 39 New Jersey 2009

ASMs for continental differentiation using HapMap III

slide-40
SLIDE 40

E Asia Oceania Africa Europe Middle East Central Asia Amerindians

25 ascertained markers. PCA

4/27/2009 40 New Jersey 2009

ASMs for continental differentiation using Illumina 650k

slide-41
SLIDE 41
  • CEPH

550,000 SNPs K = 5 (50 ascertained markers, Admixture, 500,000 burning, 500,000 retained simulations)

4/27/2009 41 New Jersey 2009

ASMs for continental differentiation using Illumina 650k

slide-42
SLIDE 42

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Africa Europe‐Meast‐Casia Easia Oceania Amerindian K=Oceania K=Europe‐Meast‐Casia K=Amerindian K=Africa K=Easia

Real geographic location

Inferred geographic location

4/27/2009 42 New Jersey 2009

Specificity of the ascertained markers

slide-43
SLIDE 43

K = 2 (5000 random markers, Admixture, 10,000 burning, 10,000 retained simulations) K = 3 (500 ascertained markers, Admixture, 10,000 burning, 10,000 retained simulations)

4/27/2009 43 New Jersey 2009

ASMs for population differentiation in the European continent using Affymetrix 500k

slide-44
SLIDE 44

5 10 15 20 25 30 35 40 45 50 25.9 25.95 26 26.05 26.1 26.15 26.2 26.25

  • log10(P-value)

Physical position (Mb)

Association between OCA_HERC2 region and iris color adjusted for ancestry sensitive markers

500KSNPs 498AncestrySensitiveSNPs Random498SNPs NoAdjustment(Orignial_AJHG)

4/27/2009 44 New Jersey 2009

Use of the 500 ASMs for correcting the effect of population substructure

slide-45
SLIDE 45
  • Recall

– Population substructure is only a problem when PHENOTIPIC and GENOTYPIC variation covariates – Why not ascertaining markers that are associated to the particular spatial pattern of the phenotype?

) ; ( ) ; ; ( ) | ; ( J Q I J P Q I J P Q I

n n n

− =

“Amount of information of the phenotype (P) conditional on the genotype (J): How well could we correctly classify one individual given that we know his phenotype if we already know his genotype in a particular locus”

4/27/2009 45 New Jersey 2009

PhenoASMs

slide-46
SLIDE 46

LCT HERC2

4/27/2009 46 New Jersey 2009

PhenoASMs for lactose tolerance

slide-47
SLIDE 47

4/27/2009 47 New Jersey 2009

PhenoASMs for Crohn disease

slide-48
SLIDE 48

AA AB BB Marginal phenotype C P(AA)P(C|AA) P(AB)P(C|AB) P(BB)P(C|BB) ∑P(g)P(C|g) D P(AA)P(D|AA) P(AB)P(D|AB) P(BB)P(D|BB) ∑P(g)P(D|g)

4/27/2009 48 New Jersey 2009

PhenoASMs: a little bit further

slide-49
SLIDE 49
  • Update θ with a Metropolis algorithm
  • Update the covariance matrix of the proposal

distribution by means of a “quasi‐perfect adaptive MCMC” (Andrieu and Atchade)

  • Compute the harmonic mean of the likelihood

in order to obtain a rough estimate of P(M|D)

4/27/2009 49 New Jersey 2009

PhenoASMs: a Bayesian approach

slide-50
SLIDE 50

4/27/2009 50 New Jersey 2009

Phenotype-genotype association for eye color

slide-51
SLIDE 51

TAS2R38

4/27/2009 51 New Jersey 2009

Phenotype-genotype association for bitter taste

slide-52
SLIDE 52

TAS2R38

4/27/2009 52 New Jersey 2009

Phenotype-genotype association for bitter taste

slide-53
SLIDE 53

4/27/2009 53 New Jersey 2009

Phenotype-genotype association for bitter taste

slide-54
SLIDE 54
  • Low to moderate human population

differentiation

  • Mainly associated to geography
  • No sharp discontinuities, except in particular

genomic regions (selection?)

  • Results depend on the clustering algorithm
  • ASMs can improve the detection of population

substructure

4/27/2009 54 New Jersey 2009

Conclusions

slide-55
SLIDE 55
  • In is a good statistic for ascertaining markers to

differentiate predefined populations

  • If a prior definition of a population is

used, ASMs will tend to differentiate such population, independently of the biological meaning

  • PhenoASMs as the next level of ASMs?

4/27/2009 55 New Jersey 2009

Conclusions

slide-56
SLIDE 56
  • M. Balascakova, C. Becker, J. Bertranpetit, L.A.

Bindoff, D. Comas, U. Gether, C. Gieger, G. Holmlund, A. Kouvatski, M. Macek, I. Mollet, M. Nelson, P. Nuernberg, W. Parson, R. Ploski, A. Ruether, A. Sajantila, S. Schreiber, A. Tagliabracci, A. Uiterlinden, T. Werge, and E. Wichmann.

In collaboration with

4/27/2009 56 New Jersey 2009

slide-57
SLIDE 57

Acknowledgements

Tim Lu Michael Krawczak Manfred Kayser

4/27/2009 57 New Jersey 2009

Petros Drineas Andreas Wollstein Peristeia Paschou

slide-58
SLIDE 58

Thank you very much!

4/27/2009 58 New Jersey 2009