Computations with Markers Paulino Prez 1 Jos Crossa 1 1 ColPos-Mxico - - PowerPoint PPT Presentation

computations with markers
SMART_READER_LITE
LIVE PREVIEW

Computations with Markers Paulino Prez 1 Jos Crossa 1 1 ColPos-Mxico - - PowerPoint PPT Presentation

Computations with Markers Paulino Prez 1 Jos Crossa 1 1 ColPos-Mxico 2 CIMMyT-Mxico June, 2015. CIMMYT, Mxico-SAGPDB Computations with Markers 1/20 Contents Genomic relationship matrix 1 Examples 2 Big Data! 3 CIMMYT,


slide-1
SLIDE 1

Computations with Markers

Paulino Pérez 1 José Crossa 1

1ColPos-México 2CIMMyT-México

June, 2015.

CIMMYT, México-SAGPDB Computations with Markers 1/20

slide-2
SLIDE 2

Contents

1

Genomic relationship matrix

2

Examples

3

Big Data!

CIMMYT, México-SAGPDB Computations with Markers 2/20

slide-3
SLIDE 3

Genomic relationship matrix

Genomic relationship matrix

The genomic relationship matrix (G) appears naturally in several models used routinely in Genomic selection. VanRaden (2008) studied efficient methods to compute genomic predictions using this matrix. There are several ways of computing the G matrix,

CIMMYT, México-SAGPDB Computations with Markers 3/20

slide-4
SLIDE 4

Genomic relationship matrix 1

G = XX ′, where X is the matrix of marker genotypes of dimensions n × p. For SNPs xij ∈ {0, 1, 2}.

2

G = (X − E)(X − E)′ 2 p

j=1 pj(1 − pj),

where pj is the minor allele frequency of SNP j = 1, ..., p, and E is a matrix of expected frequencies of xij under Hardy-Weiberg equilibrium from estimates of allelic frequencies.

3

G = ZZ ′ p , where Z is the matrix of centered and standardized SNPs codes and p is the number of SNPs, that is zij = (xij − 2pj)/

  • 2pj(1 − pj).

CIMMYT, México-SAGPDB Computations with Markers 4/20

slide-5
SLIDE 5

Genomic relationship matrix

Continue...

G = XX ′ appears naturally when we assume that we can predict the phenotypes using the linear model: y = 1µ + Xβ + e, where e ∼ N(0, σ2

eI) and β ∼ N(0, σ2 βI).

Let u = Xβ, by using the multivariate normal distribution, it can be shown that u ∼ N(0, XX ′), and the model is equivalente to y = 1µ + u + e, which is usually known as G-BLUP. We will talk about this model later on.

CIMMYT, México-SAGPDB Computations with Markers 5/20

slide-6
SLIDE 6

Examples

Examples

Figure 1: Toy example for markers.

CIMMYT, México-SAGPDB Computations with Markers 6/20

slide-7
SLIDE 7

Examples

SNP coding

1

Additive effects x =    −1 if the SNP is homozygous for the major allele if the SNP is heterozygous 1 if the SNP is homozygous for the other allele

2

Dominant effects x = −1 if the SNP is heterozygous if the SNP is homozygous

CIMMYT, México-SAGPDB Computations with Markers 7/20

slide-8
SLIDE 8

Examples

Continue...

#Clear workspace rm(list=ls()) #Set working directory setwd("C:/Users/P.P.RODRIGUEZ/Desktop/Slides Paulino/2. Gmatrix/examples/") source("Recode.R") source("Impute.R") Genotype_info=read.csv(file="TC-10-Genotypes-ACGT.csv", header=TRUE,na.strings="?_?",stringsAsFactors=FALSE) entry_Genotype_info=Genotype_info$Entry Genotype_info=Genotype_info[,-c(1,2)] X=recode(Genotype_info)$X #Impute missing genotypes set.seed(123)

  • ut=Impute(X)

CIMMYT, México-SAGPDB Computations with Markers 8/20

slide-9
SLIDE 9

Examples

Continue...

#Note that marker 167 and 179 are #monomorphic and should be excluded from analysis

  • ut$monomorphic

#Remove monomorphic markers, #At this point no more missing values are present X=out$X[,-out$monomorphic] #compute p phat=colMeans(X)/2 MAF=ifelse(phat<0.5,phat,1-phat) phat=MAF hist(MAF,main="")

CIMMYT, México-SAGPDB Computations with Markers 9/20

slide-10
SLIDE 10

Examples

Continue...

MAF Frequency 0.0 0.1 0.2 0.3 0.4 0.5 20 40 60 80 100 120 140

Figure 2: Distribution of allele frequencies.

CIMMYT, México-SAGPDB Computations with Markers 10/20

slide-11
SLIDE 11

Examples

Computations: three ways

#Computing the genomic relationship matrix G1=tcrossprod(X) X2=scale(X,center=TRUE,scale=FALSE) k=2*sum(phat*(1-phat)) G2=tcrossprod(X2)/k X3=scale(X,center=TRUE,scale=TRUE) G3=tcrossprod(X3)/ncol(X3) heatmap(G3) hist(diag(G3),main="")

CIMMYT, México-SAGPDB Computations with Markers 11/20

slide-12
SLIDE 12

Examples

Exercise

1

Load the weath dataset that we were using yesterday.

2

Compute the Genomic relationship matrix using equation 1.

CIMMYT, México-SAGPDB Computations with Markers 12/20

slide-13
SLIDE 13

Examples

Continue...

5 137 33 24 72 70 136 6 2 34 53 29 142 28 43 103 58 3 61 107 131 91 32 77 75 47 119 69 102 89 79 26 12 145 110 41 96 105 39 86 35 94 81 99 109 60 27 42 139 87 74 37 50 10 132 88 98 101 68 92 19 57 143 133 83 130 84 80 67 121 82 30 126 23 9 106 100 1 125 124 113 112 14 46 63 71 138 48 135 117 52 15 147 111 18 146 44 64 141 40 49 59 108 95 17 56 51 4 11 134 118 66 22 115 8 104 25 144 76 85 45 120 90 54 16 36 78 55 62 20 73 93 148 65 38 129 13 21 31 97 7 140 123 114 127 128 116 122 5 137 33 24 72 70 136 6 2 34 53 29 142 28 43 103 58 3 61 107 131 91 32 77 75 47 119 69 102 89 79 26 12 145 110 41 96 105 39 86 35 94 81 99 109 60 27 42 139 87 74 37 50 10 132 88 98 101 68 92 19 57 143 133 83 130 84 80 67 121 82 30 126 23 9 106 100 1 125 124 113 112 14 46 63 71 138 48 135 117 52 15 147 111 18 146 44 64 141 40 49 59 108 95 17 56 51 4 11 134 118 66 22 115 8 104 25 144 76 85 45 120 90 54 16 36 78 55 62 20 73 93 148 65 38 129 13 21 31 97 7 140 123 114 127 128 116 122

Figure 3: Heatmap of G matrix.

CIMMYT, México-SAGPDB Computations with Markers 13/20

slide-14
SLIDE 14

Examples

Continue...

diag(G3) Frequency 0.5 1.0 1.5 2.0 2.5 3.0 10 20 30 40 50 60

Figure 4: Histogram of the diagonal elements of the G matrix.

CIMMYT, México-SAGPDB Computations with Markers 14/20

slide-15
SLIDE 15

Examples

Distance matrix

The distance matrix, also appears naturally in RKHS models. We will review them in the next days, dij = ||xi − xj||2 =

  • k

(xik − xjk)2 Example:

D=as.matrix(dist(X))

CIMMYT, México-SAGPDB Computations with Markers 15/20

slide-16
SLIDE 16

Big Data!

Big Data!

The computation of the genomic relationship matrix is straight forward if the matrix X is small. There are application where the number of markers can be very big,

CIMMYT, México-SAGPDB Computations with Markers 16/20

slide-17
SLIDE 17

Big Data!

Ober’s prediction problem

Ober et al. (2012) predicts starvation stress resistance and starle resistance in Drosophila using p = 2.5 millions SNPs and n = 192 D. melanogaster inbreed lines derived by 20 generations of full sib mating from wild-caught females from the Raleigh, North Carolina population.

CIMMYT, México-SAGPDB Computations with Markers 17/20

slide-18
SLIDE 18

Big Data!

Continue...

Genomic relationship matrix for Ober’s data.

CIMMYT, México-SAGPDB Computations with Markers 18/20

slide-19
SLIDE 19

Big Data!

Solution

Fortunately the computation of the G matrix can be fully paralleled in modern CPU processors, Gij =

  • k

(xik − 2pk)(xjk − 2pk)/c When computing Gij only the genotypes of individuals (i, j) are needed.

CIMMYT, México-SAGPDB Computations with Markers 19/20

slide-20
SLIDE 20

Big Data!

Continue...

CIMMYT, México-SAGPDB Computations with Markers 20/20