Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya University of Georgia
Animal Breeding • Selection of more productive animals as parents • Effect=> Next generation is more productive • Tools: artificial insemination + embryo transfer – A dairy sire can have >100,000 daughters! – Selection of “good” sires important!
Information for selection • DNA (active research) • Data collected on large populations of farm animals – Records contain combination of genetics, environment, and systematic effects – Need statistical methodology to obtain best prediction of genetic effects
Dairy • Mostly family farms (50-1000 cows) • Mostly Holsteins • Recording on: – Production (milk, fat, protein yields) – Conformation (size, legs, udder..) – Secondary traits (Somatic cell count in milk,Calving ease, reproduction) • Information on > 20 million Holsteins (kept by USDA and breed associations) • Semen market global
Molecular genetics
Poultry • Integrated system • Large companies • Hierarchical breeding structure • Final product - crossbreds • Useful records on up to 200k birds • Recording on: – Number and size of eggs – Growth (weights at certain ages) – Fertility – …
Swine • Becoming more integrated • Hierarchical breeding structure • Final product - crossbreds • Recording on: – Growth – Litter size – Meat quality – … • Populations > 200k animals
Beef cattle • Mostly family farms (5 - 50,000) • Many breeds • Final product – purebreds and crossbreds • Recording on: – growth – fertility – meat quality • Data size up to 2 million animals
Results of genetic selection • Milk yield 2 times higher • Chicken – Time to maturity over 2 times shorter – Food efficiency 2 times higher • Swine • Beef • Fish
Genetic value of individual On a population level: g = a + “rest ” Var ( a )= A s a, A - matrix of relationships among animals, s a - additive variance A dense; A -1 sparse and easy to set up
Example of a model • Litter weight = – contemporary group + – age class + – genetic group + – animal + – e – Var( e ) = I s e , var( animal )= A s a – s e , s a - variance components
Mixed model y – vector of records ß – vector of fixed effects u – vector of random effects e – vector of residuals X, Z – design matrices Fixed effects – usually few levels, lots of information Random effects – usually many levels, little information
Mixed model equations R block diagonal with small blocks G block diagonal with small or large blocks
Matrices in Mixed Model • Symmetric • Semi-positive definite • Sparse – 3-200 nonzeros per row (on average) • Can be constructed as a sum of outer products Σ W i Q i W i W i – 1-20 x 20k-300 million, < 100 nonzeroes Q i - small square matrix (less than 20 x 20)
Models • Sire model Y = cg +… sire + e 1000k animals ≈ 50k equations • Animal model Y= cg +..+ animal + e 1000k animals ≈ 1200k equations • Mutiple trait model y 1 y 1 = cg 1 +…+ animal 1 + …+ e 1 1000k animals >3000k equations …. y n y n = cg n +…+ animal n + …+ e n • Random regression model 1000k animals >6000k equations (Longitudinal) y = cg +…+ Σ f i (x)animal i + …+ e y
Tasks • Estimate variance components – Usually sample of 5-50k animals • Solve mixed model equations – Populations up to 20 million animals – Up to 60 unknowns per animal
Data structures for sparse matrices • Linked list • Triples as Hash • IJA – (row pointer to columns and values) • Matrix not stored
Data structures
Solving strategies • Sparse factorization • Iteration – Gauss Seidel – Gauss Seidel + 2 nd order Jacobi – PCG • Conditioners from diagonal to incomplete factorization
Gauss-Seidel Iteration & SOR Ax = b -Simple -Stable and self correcting -Converges for semi-positive definite A -For balanced cross-classified models converges in one round -Small memory requirements -Hard to implement matrix-free -Slow convergence for complicated models
Preconditioned Conjugate Gradient Large memory requirements Tricky implementation Easy to implement matrix-free Usually converges a few times faster than SOR even with diagonal preconditioner! Matrix-free iteration Let A = Σ (Wi’ Wi) nonzeros(W) << nonzeros(A) W easily generated from data Ax = Σ (Wi’ Wi x)
Methodologies to estimate variance components • Restricted Maximum Likelihood (REML) • Monte Carlo Markov Chain
REML Φ - variance components (in R and G) C* - LHS converted to full rank Maximizations Derivative free (use sparse factorization) First derivative (Expectation maximization; use sparse inverse) Second derivative D 2 and E(D 2 ) hard to compute but [D 2 and E(D 2 ) ]/2 simpler – Average Information REML
Sparse matrix inversion • Takahashi method: • Can obtain inverse elements only for elements where L ≠ 0 • Inverses obtained for sparse matrices as large as 1000kx1000k • Cost ≈ 2 x sparse factorization
REML, properties • Derivative free methods reliable only for simple problems • Derivative methods – Difficult formulas • Nearly impossible for nonstandard models – High computing cost ≈ quadratic – Easy determination of termination
Bayesian Methods and MCMC
Samples
Approximate P(s e |y)
Approximate (p(s p |y)
Approximate (p(s a |y)
MCMC, properties – Much simpler formulas – Can accommodate • large models • Complicated models – Can take months if not optimized – Details important and hard decision when to stop
Optimization in sampling methods • 10k-1 million samples • Slow if equations regenerated each round • If equations can be represented as: [R ⊗ X1 + G ⊗ X2] = R ⊗ y R,G – estimated small matrices, X1, X2, y – constant X1, X2 and y can be created and stored once. Requires tricks if models different/trait and missing traits
Software • SAS (fixed models or small mixed models) • Custom • Packages – PEST, VCE (Groeneveld et al) – ASREML (Gilmour) – DMU (Jensen et al) – MATVEC (Wang et al) – Blupf90 etc. (Misztal et al)
Larger data, more complicated models, simpler computing?
Computing platforms • Past – Mainframe – Supercomputers • Sparse computations vectorize! • Current – PCs + workstations – Windows/Linux,Unix • Parallel and vector processing not important
Random regression model on a parallel processor Madsen et al, 1999; Lidauer et all, 1999) Goal: compute Σ (W i ‘W i x); W i -large sparse vectors • Approaches: Distribute collection of Σ (W i ‘W i )x t o separate processors a) b) Optimize scalar algorithm first to Σ (W i ‘(W i x)) If W i has 30 nonzeros: a) 900 multiplications b) 60 multiplications Scalar optimization more important than brute force parallelization
Other Models • Censored • Survival • Threshold • …..
Issues • 0.99 problem • Sophistication of statistics vs. understanding of problem vs. data editing • Undesired responses of selection – Less fitness,…. – Aggressiveness (swine, poultry) • Challenges of molecular genetics
Molecular Genetics • Attempts to identify effects of genes on individual traits • Simple statistical methodologies • Methodology for joint analyses with phenotypic and DNA data difficult • Active research area
Conclusions • Animal breeding compute intensive – Large systems of equations – Matrices sparse • Research has high large economic value
Recommend
More recommend