statistical analysis of genetic and phenotypic data for
play

Statistical Analysis of Genetic and Phenotypic Data for Breeders: - PowerPoint PPT Presentation

Statistical Analysis of Genetic and Phenotypic Data for Breeders: Hands on Practical Sessions (GBLUP-RR) Paulino Prez 1 Jos Crossa 2 1 ColPos-Mxico 2 CIMMyT-Mxico June, 2015. CIMMYT, Mxico-SAGPDB Statistical Analysis of Genetic and


  1. Statistical Analysis of Genetic and Phenotypic Data for Breeders: Hands on Practical Sessions (GBLUP-RR) Paulino Pérez 1 José Crossa 2 1 ColPos-México 2 CIMMyT-México June, 2015. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 1/34

  2. Contents General comments 1 GBLUP-Ridge Regression 2 Application examples 3 Biplot from marker effects 4 Extension of BRR to include infinitesimal effect 5 CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 2/34

  3. General comments General comments Remember, A simple model used frequently in plant breeding stands that the 1 phenotypic value of an individual ( P ) is expressed as the summation of the genetic value ( G ) and the residual environmental effect ( E ): P = G + E , (1) where G includes additive, dominance and epistatic effects. A model that includes solely additive effects ( A ) can be easily derived 2 from (1), and can be expressed as follows, P = A + E ′ (2) where E ′ includes effects that are non additive. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 3/34

  4. General comments Continue... The breeding value ( BV ) for an individual can be computed based on narrow sense heritability ( h 2 ), BV i = µ + h 2 ( y i − µ ) , where µ is mean phenotypic value of a population and y i is the phenotypic value for individual i . Obviously it is necessary to have information of parents and offsprings to compute this. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 4/34

  5. General comments Continue... In Genomic Selection (GS), genetic values are approximated using linear regression (Meuwissen et al., 2001), that is: p � y i = g i + e i = µ + x ij β j + e i (3) j = 1 Relationships between marker genotypes ( x 1 i : 0 and 1) and phenotypes ( y i ) of the individuals (open circles) in a training population. If the marker genotype is correlated with the phenotype, segregation is modelled using the bold line (taken from Nakaya and Isobe, 2012). CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 5/34

  6. General comments Continue... In GS it is possible to obtain Genomic Estimated Breeding Values (GEBVs for short). This can be done simply by adding up marker effects (according to its marker genotypes) obtained from a training population, that is: p x ij ˆ � GEBV i = β j (4) j = 1 y i (and in some cases ˆ Next we show how to obtain the predictions ˆ β j ) using several models. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 6/34

  7. General comments Continue... Figure 1: Graphical representation of parametric and non-parametric methods used commonly in whole-genomic prediction. In this presentation we will focus in Ridge Regression. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 7/34

  8. General comments Continue... 0.8 Gaussian Double Exponential Scaled − t (5df) BayesC ( π =0.25) 0.6 p( β j ) 0.4 0.2 0.0 − 6 − 4 − 2 0 2 4 6 β j Figure 2: Prior densities of regression coefficients with Markers. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 8/34

  9. GBLUP-Ridge Regression GBLUP-RR This is the most basic model used in GS. Let p � y i = g i + e i = µ + x ij β j + e i j = 1 marker effects are obtained by solving the following optimization problem, � � � � � β 2 X j β j ) ′ ( y − min β , λ ( y − X j β j ) + λ , (5) j where λ > 0 is a regularization parameter. Notes: λ is unknown and can be selected by using cross-validation 1 we need to minimize a “penalized sum of squares” . 2 CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 9/34

  10. GBLUP-Ridge Regression Continue... The optimization problem has a closed solution, β = ( X ′ X + λ I ) − 1 X ′ ˜ ˆ y , where ˜ y = y − µ 1 . Unfortunately, we need to know the value of λ to use this solution. The problem can be solved easily using the Bayesian framework. Let β ∼ N ( 0 , σ 2 β I ) and e ∼ N ( 0 , σ 2 e I ) , and u = X β , then model (3) can be written as: y = µ 1 + u + e (6) . Note that u ∼ N ( 0 , σ 2 β XX ′ ) Model (6) is know as GBLUP CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 10/34

  11. GBLUP-Ridge Regression Training and testing sets Note also that the covariance matrix for u involves the product XX ′ , which is proportional to the Genomic Relationship Matrix proposed by VanRaden (2008). We will assume that u ∼ N ( 0 , σ 2 u G ) with G = XX ′ / k . The mix-model equations for (6) are as follows: 1 ′ 1 σ − 2 1 ′ σ − 2 � � � � � 1 ′ y � µ ˆ e e = (7) 1 ′ σ − 2 I σ − 2 e σ − 2 + G σ − 2 ˆ u y e u u u and µ are obtained solving the mix-model equations, assuming that the variance components σ 2 e and σ 2 u are known. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 11/34

  12. GBLUP-Ridge Regression Continue... If we have individuals for training and testing, we can partition G and u as follows, � G 11 � u 1 � y 1 � 1 1 � � � � G 12 G = , u = , y = , 1 = G 21 G 22 u 2 y 2 1 2 µ and ˆ 1=individuals in the training set, 2=individuals in the testing set. ˆ u 1 are obtained as the solution of the mix-model equations, 1 1 1 σ − 2 1 σ − 2 � 1 ′ 1 ′ � � � � 1 ′ � µ ˆ 1 y 1 e e = 1 σ − 2 I 11 σ − 2 e σ − 2 + G 11 σ − 2 ˆ 1 ′ u 1 y 1 e u u The predictions for individuals in the testing set are given by µ 1 2 + G 21 G − 1 ˆ 11 ˆ y 2 = ˆ u 1 CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 12/34

  13. Application examples Wheat dataset Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers ( x ij , i = 1 , ..., n , j = 1 , ..., p ) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R, Load R 1 Install BGLR package (if not yet installed) 2 Load the package 3 Load the data 4 CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 13/34

  14. Application examples Continue... Figure 3: Loading the BGLR package and the wheat dataset. CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 14/34

  15. Application examples Continue... You can explore the MM matrix, pedigree matrix within R, fix(wheat.X) fix(wheat.A) CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 15/34

  16. Application examples Continue... Lets assume that we want to predict the grain yield for environment 1 using . We do not know the value for σ 2 ridge regression or equivalently the GBLUP e and λ , so we can obtain estimates using the data. We will use the function BGLR. R code below fit the RR model using Bayesian approach with non informative priors for σ 2 e , σ 2 β , rm(list=ls()) library(BGLR) data(wheat) X=wheat.X Y=wheat.Y #Linear predictor ETA=list(list(X=X,model="BRR")) #Or #ETA=list(Markers=list(X=X,model="BRR")) fmR<-BGLR(y=Y[,1],ETA=ETA,nIter=10000,burnIn=5000,thin=10) plot(fmR$yHat,Y[,1]) CIMMYT, México-SAGPDB Statistical Analysis of Genetic and Phenotypic Data for Breeders:Hands on Practical 16/34

Recommend


More recommend