Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR statistical package Biometrics and Statistics Unit, Global Maize and Wheat programs June, 2015. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 1/26
Contents BGLR 1 Prediction in multi-environments 2 Models 3 Cross validation 4 Application examples 5 CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 2/26
BGLR BGLR A novel software for whole genomic regression an prediction for continuous, discrete traits, censored and uncensored. Suitable for big p and small n problems. Many non-parametric and parametric models implemented in a consistent manner. Large collection of Bayesian models included: Bayesian ridge regression. Bayesian LASSO. BayesA, BayesB, BayesC- π . Reproducing Kernel Hilbert Spaces. Reproducing Kernel Hilbert Spaces with Kernel-Averaging. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 3/26
BGLR Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 4/26
BGLR BGLR in a nutshell Data equation: y = η + ε where η = 1 µ + � X j β j + u l . Piors: Different priors can be assigned to regression coefficients and random effects u l , which leads to different models. Model fitting using MCMC algorithms (Gibbs sampler and Metropolis-Hastings) implemented efficiently. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 5/26
Prediction in multi-environments Prediction in multi-environments In most agronomic traits, the effects of genes are modulated by environmental conditions, generating G × E. Researchers working in plant breeding have developed multiple methods for accounting for, and exploiting G × E in multi-environment trials. Genomic selection is gaining ground in plant breeding. Most applications so far are based on single-environment/single-trait models. Preliminary evidence (e.g., Burgueño et al., 2012) suggests that there is great scope for improving prediction accuracy using multi-environment models. The ideas can be taken one step further by incorporating information on environmental covariates. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 6/26
Prediction in multi-environments Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 7/26
Prediction in multi-environments Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 8/26
Models Models Model 1 (EL, Environment + Line, no pedigree) y ij = µ + E i + L i + e ij Model 2 (EA, Environment + Line, with markers) y ij = µ + E i + g j + e ij Model 3 (Environments, Line and interactions markes and environment) y ij = µ + E i + g j + Eg ij + e ij CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 9/26
Models Assumptions It is assumed that E i ∼ N ( 0 , σ 2 E ) , g ∼ N ( 0 , σ 2 g G ) with G being the genomic relationship matrix and Eg ij the interaction term between genotypes and environment. Eg ∼ N ( 0 , ( Z g GZ T g ) · Z E Z T E ) , Z g connects genotypes with phenotypes, Z E connects phenotypes with environments, and · stands for Hadamart product between two matrices. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 10/26
Cross validation Cross validation CV1: Prediction of performance of newly developed lines (i.e., lines that 1 have not been evaluated in any field trials). CV2: Prediction in incomplete field trials; here the aim was to predict 2 performance of lines that have been evaluated in some environments but not in others. See Figure in next slide. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 11/26
Cross validation Continue... Figure 1: Two hypothetical cross-validation schemes (CV1 and CV2) for five lines (Lines 1-5) and five environments (E1-E5), source: Jarquín et al. (2014). CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 12/26
Application examples Example 1 Wheat dataset (Ravi, Jessica et al.) The phenotypic information consists in grain yield for wheat in 5 mega environments. Table 1 . Number of lines evaluated in each environment The problem is to predict 9, 000 unobserved individuals in all the environments. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 13/26
Application examples Continue... Table 2 . Phenotypic correlations between environments. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 14/26
Application examples Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 15/26
Application examples Continue... In order to do model fitting we used COP and markers (GBS). COP: We computed a relationship matrix ( A ). The matrix has about 1 50 k × 50 k = 2500 , 000 , 000 entries. We used BROWSE, the program took several days to finish. We used a ‘ad-hoc’ version of the R program pedigreemm and we got the matrix in about 3 hours . Markers: Information for about 21,000 individuals and 14,000 individuals 2 was available. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 16/26
Application examples Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 17/26
Application examples Benchmark: Predicting 2014 using previous records Figure 2: Predictions in testing CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 18/26
Application examples The real problem CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 19/26
Application examples Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 20/26
Application examples Continue... CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 21/26
Application examples Example 2: Biparental Tropical maize populations (Xuecai et al.) Genotypic and phenotypic information for about 20 biparental populations. Low (about 200) and Hight density markers (about 60,000). Individuals evaluated in several environments. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 22/26
Application examples Continue... Figure 3: Results from CV1 CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 23/26
Application examples Continue... Figure 4: Results from CV2 CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 24/26
Application examples Collaborators in this work J. Crossa Susan Dreisigaker Juan Burgueño Paulino Pérez G. de los Campos X. Zhang Jessica Rutoski K. Semagn Ravi Singh Y. Beyene Enrique Autrique R. Babu Jesee Poland F . San Vicente Juan Carlos Alarcón M. Olsen Newman Samayoua CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 25/26
Application examples References Burgueño, J., G. de-los-Campos, K. Weigel, and J. Crossa. (2012). Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Science , 43: 311-320. Jarquín, D., J. Crossa, X. Lacaze, P . Cheyron, J. Daucourt, J. Lorgeou, F . Piraux, et al . (2014). A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and Applied Genetics , 127 (3): 595-607. CIMMyT Genomic Prediction and Selection for Multi-Environments with Big Data using the BGLR 26/26
Recommend
More recommend