pca and admixture proportions for ngs data
play

PCA and admixture proportions for NGS data Anders Albrechtsen - PowerPoint PPT Presentation

PCA and admixture proportions for NGS data Anders Albrechtsen Admixture model NGSadmix Introduction to PCA PCA for NGS Ancient Eskimo a a Rasmussen et. al. , Nature 2010 Figure: First principal components of selected populations. Admixture


  1. PCA and admixture proportions for NGS data Anders Albrechtsen

  2. Admixture model NGSadmix Introduction to PCA PCA for NGS Ancient Eskimo a a Rasmussen et. al. , Nature 2010 Figure: First principal components of selected populations.

  3. Admixture model NGSadmix Introduction to PCA PCA for NGS Admixture: Inuit vs. European

  4. Admixture model NGSadmix Introduction to PCA PCA for NGS 26.5% European admixture in Greenland

  5. Admixture model NGSadmix Introduction to PCA PCA for NGS Admixture: PCA analysis

  6. Admixture model NGSadmix Introduction to PCA PCA for NGS What hope you get out of this session Learning objectives • Be able to understand the underlying model for PCA and admixture analysis • Understand the problem with genotype uncertainty for NGS data • Learn some tricks and tools to overcome the problem • Be able to do PCA and to infer ancestry proportions from NGS data

  7. Admixture model NGSadmix Introduction to PCA PCA for NGS Outline Admixture model 1 Intro to the problem Maximum likelihood (ML) solution based on called genotypes NGSadmix 2 ML inference based on genotype likelihoods Introduction to PCA 3 population structure and PCA Problems with PCA analysis NGS data PCA for NGS 4 The expectation of the covariance

  8. Admixture model NGSadmix Introduction to PCA PCA for NGS Basic problem • Underlying assumption : All analyzed individuals have DNA that originates from one or more of K source populations. • Consequence : they will have some proportion of DNA from each of the K populations; these proportions are called admixture proportions • Examples with K=2: • A dataset consisting of samples from two different populations • Two populations that have mixed • Goal: to infer overall admixture proportions (+ identify clusters)

  9. Admixture model NGSadmix Introduction to PCA PCA for NGS Overview of well known solutions • Several methods: • Bayesian: e.g. Structure (Pritchard et al. 2000, BABS - Mikko J Sillanp • Maximum Likelihood: e.g. Admixture (Alexander et al. 2009) • They all base their inference on called genotypes and infer Admixture proportions, Q 1 Allele frequencies for all loci for all K populations, F 2

  10. Admixture model NGSadmix Introduction to PCA PCA for NGS ML solution • To find an ML solution we have to • Define a model/likelihood function p ( data | Q , F ) • Find an efficient way to find argmax p ( data | Q , F ) ( Q , F ) • The latter is usually solved using EM which I will no focus on • I will spend time describing the model/likelihood function

  11. Admixture model NGSadmix Introduction to PCA PCA for NGS Likelihood function (1 individual i , 1 diallelic locus j ) Assume K source populations and let • Q i = ( q i 1 , q i 2 , ..., q i K ) be i ’s genomewide admixture proportions • G ij be the genotype of i in j (measured in counts of allele A) • F j = ( f j 1 , f j 2 , ..., f j K ) denote the allele frequencies of allele A Then • for one of i ’s alleles: p ( allele | Q i , F j ) = q i 1 f j 1 + q i 2 f j 2 + ... q i K f j K = h ij • therefore assuming HWE we get the likelihood function: ( h ij ) 2  if G ij = 2 ,  p ( G ij | Q i , F j ) = 2 h ij (1 − h ij ) if G ij = 1 , (1 − h ij ) 2 if G ij = 0 . 

  12. Admixture model NGSadmix Introduction to PCA PCA for NGS Likelihood function (N individuals, M diallelic loci) • If we assume: • the individuals are unrelated and thus independent • loci are independent we can write the likelihood as N M � � p ( G ij | Q i , F j ) p ( G | Q , F ) = i j with G = ( G ij ), Q = ( Q 1 , Q 2 , ... Q N ) and F = ( F 1 , F 2 , ... F M ). • Based on this we can find ML estimates: ( ˆ Q , ˆ F ) = argmax p ( G | Q , F ). ( Q , F ) Very large number of parameters M × K + N × (K-1)

  13. Admixture model NGSadmix Introduction to PCA PCA for NGS reformulation of the likelhood the likelihood - G ij = g 1 ij + g 2 ij N M � � p ( G ij | Q i , F j ) p ( G | Q , F ) = i j N M 2 � � � p ( g d ij | Q i , F j ) ∝ i j d 2 N M � � � � p ( g d ij | Q i , F j , A ) p ( A | Q i , F j ) = i j d A N M 2 � � � � p ( g d ij | F j , A ) p ( A | Q i ) = i j d A ij | F j , A ) = f j ij ) + (1 − f j p ( A | Q i ) = q i A and p ( g d A I 0 ( g d A ) I 1 ( g d ij )

  14. Admixture model NGSadmix Introduction to PCA PCA for NGS Some problems (NGS data, variable depth)

  15. Admixture model NGSadmix Introduction to PCA PCA for NGS Some problems (NGS data, variable depth)

  16. Admixture model NGSadmix Introduction to PCA PCA for NGS Genotype likelihoods The genotype likelihood p ( data | geno ) Summarise the data in 10 genotype likelihoods A C G T bases: A * * * * TTTCCTTTTTTTTTTTTT C * * * ֌ quality score: G * * BBGHSSBBTTTTGHRSBB T *

  17. Admixture model NGSadmix Introduction to PCA PCA for NGS Genotype likelihoods The genotype likelihood p ( data | geno ) Summarise data for diallelic site bases: 0 p ( data | geno = 0) TTTCCTTTTTTTTTTTTT 1 p ( data | geno = 1) ֌ quality score: 2 p ( data | geno = 2) BBGHSSBBTTTTGHRSBB

  18. Admixture model NGSadmix Introduction to PCA PCA for NGS Solution: NGSadmix • Works on genotype likelihoods instead of called genotypes • I.e. input is p ( X ij | G ij ) for all 3 possible values of G ij , where X ij is NGS data for individual i at locus j • The previous likelihood is extended from N M p ( G ij | Q i , F j ) � � p ( G | Q , F ) = i j to N M N M � � p ( X ij | Q i , F j ) = � � � p ( X ij | G ij ) p ( G ij | Q i , F j ) p ( X | Q , F ) = i j i j Gij ∈{ 0 , 1 , 2 } • Note that for known genotypes the two are equivalent • A solution is found using an EM-algorithm

  19. Admixture model NGSadmix Introduction to PCA PCA for NGS Solution: NGSadmix • Does well even for low depth and variable depth data:

  20. Admixture model NGSadmix Introduction to PCA PCA for NGS Principal component analysis SVD X = UDV T PCA for a covariance matrix MM T = C = VDV T • The first principal component/eigenvector accounts for as much of the variability in the data as possible • Can be use to reduce the dimension of the data Goal Capture the population structure in a low dimensional space

  21. Admixture model NGSadmix Introduction to PCA PCA for NGS Genotype data

  22. Admixture model NGSadmix Introduction to PCA PCA for NGS The first principal component

  23. Admixture model NGSadmix Introduction to PCA PCA for NGS Measure for pairwise differences Identical by descent (IBS) matrix - used in MDS The average genotype distance between individuals pros fast cons Ignores allele frequency cons Problems with some kinds of missingness Covariance / correlation matrix - used in PCA pros good weighting scheme for each site cons Slower and cannot easily deal with any kind of missing data

  24. Admixture model NGSadmix Introduction to PCA PCA for NGS Approximation of the genotype covariance M number of sites G genotypes G j genotypes for individual j G j k genotypes for site k in individual j f k allele frequency for site k Known genotypes - covariance between individuals i and j a a Patterson N, Price AL, Reich D, plos genet. 2006 M k − 2 f k )( G j ( G i cov ( G i , G j ) = 1 k − 2 f k ) = 1 � G ˜ ˜ G T 2 f k (1 − f k ) M M k =1 G i k − 2 f k ˜ G i k = var ( G k ) = 2 f k (1 − f k ) � 2 f k (1 − f k )

  25. Admixture model NGSadmix Introduction to PCA PCA for NGS IBS matrix genotypes Assuming the genotypes are coded as { 0 , 1 , 2 } Commonly used distance measure k = G j G i k ∧ G i  0 if k � = 1  k = G j  G i k ∧ G i 1 if k = 1  k , G j d ( G i k ) = k − G j | G i 1 if k | = 1   k − G j  | G i 2 if k | = 2 IBS distance between individual i and j M d ( G i , G j ) = 1 � k , G j d ( G i k ) M k =1

  26. Admixture model NGSadmix Introduction to PCA PCA for NGS The two first principal component 0.1 Pop 1 0.0 ● PC 2 Pop 2 ● Pop 3 ● −0.1 −0.2 −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 PC 1

  27. Admixture model NGSadmix Introduction to PCA PCA for NGS Early use of PCA in genetics Shown 1 PC at 400 locations Science 1978 Menozzi P, Piazza A, Cavalli-Sforza L. Data 38 loci

  28. Admixture model NGSadmix Introduction to PCA PCA for NGS PCA mania Eigenstrat a Genetic map from PCA a a Price et. al, nat genet. 2008 a Novembre et. al, nat genet. 2008

  29. Admixture model NGSadmix Introduction to PCA PCA for NGS Sample size/information bias Uneven sample sizes will bias both the distance and the pattern a a McVean G PLoS Genet. (2009)

  30. Admixture model NGSadmix Introduction to PCA PCA for NGS Dealing with Missingness Covariance matrix - Eigensoft a a Patterson N, Price AL, Reich D, plos genet. 2006 If a genotype is missing then ˜ G k i is set to zero • E [˜ G i k ] = 0 for a random individual • E [ cov ( G i , G j )] = 0 without relatedness or admixture. or a site is discarded • Not possible for large samples • Will likely cause ascertainment bias IBS matrix The site is skipped for the pair of individuals • Missingness must be random

Recommend


More recommend