computations with markers
play

Computations with Markers Paulino Prez 1 Jos Crossa 1 1 ColPos-Mxico - PowerPoint PPT Presentation

Computations with Markers Paulino Prez 1 Jos Crossa 1 1 ColPos-Mxico 2 CIMMyT-Mxico June, 2015. CIMMYT, Mxico-SAGPDB Computations with Markers 1/20 Contents Genomic relationship matrix 1 Examples 2 Big Data! 3 CIMMYT,


  1. Computations with Markers Paulino Pérez 1 José Crossa 1 1 ColPos-México 2 CIMMyT-México June, 2015. CIMMYT, México-SAGPDB Computations with Markers 1/20

  2. Contents Genomic relationship matrix 1 Examples 2 Big Data! 3 CIMMYT, México-SAGPDB Computations with Markers 2/20

  3. Genomic relationship matrix Genomic relationship matrix The genomic relationship matrix ( G ) appears naturally in several models used routinely in Genomic selection. VanRaden (2008) studied efficient methods to compute genomic predictions using this matrix. There are several ways of computing the G matrix, CIMMYT, México-SAGPDB Computations with Markers 3/20

  4. Genomic relationship matrix 1 G = XX ′ , where X is the matrix of marker genotypes of dimensions n × p . For SNPs x ij ∈ { 0 , 1 , 2 } . 2 G = ( X − E )( X − E ) ′ j = 1 p j ( 1 − p j ) , 2 � p where p j is the minor allele frequency of SNP j = 1 , ..., p , and E is a matrix of expected frequencies of x ij under Hardy-Weiberg equilibrium from estimates of allelic frequencies. 3 G = ZZ ′ p , where Z is the matrix of centered and standardized SNPs codes and p is � the number of SNPs, that is z ij = ( x ij − 2 p j ) / 2 p j ( 1 − p j ) . CIMMYT, México-SAGPDB Computations with Markers 4/20

  5. Genomic relationship matrix Continue... G = XX ′ appears naturally when we assume that we can predict the phenotypes using the linear model: y = 1 µ + X β + e , where e ∼ N ( 0 , σ 2 e I ) and β ∼ N ( 0 , σ 2 β I ) . Let u = X β , by using the multivariate normal distribution, it can be shown that u ∼ N ( 0 , XX ′ ) , and the model is equivalente to y = 1 µ + u + e , which is usually known as G-BLUP . We will talk about this model later on. CIMMYT, México-SAGPDB Computations with Markers 5/20

  6. Examples Examples Figure 1: Toy example for markers. CIMMYT, México-SAGPDB Computations with Markers 6/20

  7. Examples SNP coding Additive effects 1  − 1 if the SNP is homozygous for the major allele  x = 0 if the SNP is heterozygous 1 if the SNP is homozygous for the other allele  Dominant effects 2 � − 1 if the SNP is heterozygous x = 0 if the SNP is homozygous CIMMYT, México-SAGPDB Computations with Markers 7/20

  8. Examples Continue... #Clear workspace rm(list=ls()) #Set working directory setwd("C:/Users/P.P.RODRIGUEZ/Desktop/Slides Paulino/2. Gmatrix/examples/") source("Recode.R") source("Impute.R") Genotype_info=read.csv(file="TC-10-Genotypes-ACGT.csv", header=TRUE,na.strings="?_?",stringsAsFactors=FALSE) entry_Genotype_info=Genotype_info$Entry Genotype_info=Genotype_info[,-c(1,2)] X=recode(Genotype_info)$X #Impute missing genotypes set.seed(123) out=Impute(X) CIMMYT, México-SAGPDB Computations with Markers 8/20

  9. Examples Continue... #Note that marker 167 and 179 are #monomorphic and should be excluded from analysis out$monomorphic #Remove monomorphic markers, #At this point no more missing values are present X=out$X[,-out$monomorphic] #compute p phat=colMeans(X)/2 MAF=ifelse(phat<0.5,phat,1-phat) phat=MAF hist(MAF,main="") CIMMYT, México-SAGPDB Computations with Markers 9/20

  10. Examples Continue... 140 120 100 80 Frequency 60 40 20 0 0.0 0.1 0.2 0.3 0.4 0.5 MAF Figure 2: Distribution of allele frequencies. CIMMYT, México-SAGPDB Computations with Markers 10/20

  11. Examples Computations: three ways #Computing the genomic relationship matrix G1=tcrossprod(X) X2=scale(X,center=TRUE,scale=FALSE) k=2*sum(phat*(1-phat)) G2=tcrossprod(X2)/k X3=scale(X,center=TRUE,scale=TRUE) G3=tcrossprod(X3)/ncol(X3) heatmap(G3) hist(diag(G3),main="") CIMMYT, México-SAGPDB Computations with Markers 11/20

  12. Examples Exercise Load the weath dataset that we were using yesterday. 1 Compute the Genomic relationship matrix using equation 1. 2 CIMMYT, México-SAGPDB Computations with Markers 12/20

  13. Examples Continue... 122 116 128 127 114 123 7 140 97 31 21 13 129 38 65 148 93 73 20 62 55 78 36 16 54 90 120 45 85 76 144 25 104 8 115 22 66 118 134 11 4 51 56 17 108 95 59 49 40 141 64 44 146 18 111 147 15 52 117 135 48 138 71 63 46 14 112 113 124 125 1 100 106 9 23 126 30 82 121 67 80 84 130 83 133 143 57 19 92 68 98 101 88 132 10 50 37 74 87 139 42 27 60 109 99 81 94 35 86 39 105 96 41 110 145 12 26 79 89 102 69 119 47 75 77 32 91 131 107 61 3 58 103 43 28 142 53 29 34 2 6 136 70 72 24 33 137 5 5 137 33 24 72 70 136 6 2 34 53 29 142 28 43 103 58 3 61 107 131 91 32 77 75 47 119 69 102 89 79 26 12 145 110 41 96 105 39 86 35 94 81 99 109 60 27 42 139 87 74 37 50 10 132 98 88 101 68 92 19 57 143 133 83 130 84 80 67 121 82 30 126 23 9 106 100 1 124 125 113 112 14 46 63 71 138 48 135 117 52 15 147 18 111 146 44 64 141 40 49 59 108 95 17 56 51 4 11 134 118 66 22 115 8 104 25 144 76 85 45 120 90 54 16 36 78 55 62 20 73 93 148 65 38 129 13 21 31 97 7 140 123 114 127 128 122 116 Figure 3: Heatmap of G matrix. CIMMYT, México-SAGPDB Computations with Markers 13/20

  14. Examples Continue... 60 50 40 Frequency 30 20 10 0 0.5 1.0 1.5 2.0 2.5 3.0 diag(G3) Figure 4: Histogram of the diagonal elements of the G matrix. CIMMYT, México-SAGPDB Computations with Markers 14/20

  15. Examples Distance matrix The distance matrix, also appears naturally in RKHS models. We will review them in the next days, d ij = || x i − x j || 2 = � ( x ik − x jk ) 2 k Example: D=as.matrix(dist(X)) CIMMYT, México-SAGPDB Computations with Markers 15/20

  16. Big Data! Big Data! The computation of the genomic relationship matrix is straight forward if the matrix X is small. There are application where the number of markers can be very big, CIMMYT, México-SAGPDB Computations with Markers 16/20

  17. Big Data! Ober’s prediction problem Ober et al. (2012) predicts starvation stress resistance and starle resistance in Drosophila using p = 2 . 5 millions SNPs and n = 192 D. melanogaster inbreed lines derived by 20 generations of full sib mating from wild-caught females from the Raleigh, North Carolina population. CIMMYT, México-SAGPDB Computations with Markers 17/20

  18. Big Data! Continue... Genomic relationship matrix for Ober’s data. CIMMYT, México-SAGPDB Computations with Markers 18/20

  19. Big Data! Solution Fortunately the computation of the G matrix can be fully paralleled in modern CPU processors, � G ij = ( x ik − 2 p k )( x jk − 2 p k ) / c k When computing G ij only the genotypes of individuals ( i , j ) are needed. CIMMYT, México-SAGPDB Computations with Markers 19/20

  20. Big Data! Continue... CIMMYT, México-SAGPDB Computations with Markers 20/20

Recommend


More recommend