computations in animal breeding
play

Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya - PowerPoint PPT Presentation

Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya University of Georgia Animal Breeding Selection of more productive animals as parents Effect=> Next generation is more productive Tools: artificial


  1. Computations in Animal Breeding Ignacy Misztal and Romdhane Rekaya University of Georgia

  2. Animal Breeding • Selection of more productive animals as parents • Effect=> Next generation is more productive • Tools: artificial insemination + embryo transfer – A dairy sire can have >100,000 daughters! – Selection of “good” sires important!

  3. Information for selection • DNA (active research) • Data collected on large populations of farm animals – Records contain combination of genetics, environment, and systematic effects – Need statistical methodology to obtain best prediction of genetic effects

  4. Dairy • Mostly family farms (50-1000 cows) • Mostly Holsteins • Recording on: – Production (milk, fat, protein yields) – Conformation (size, legs, udder..) – Secondary traits (Somatic cell count in milk,Calving ease, reproduction) • Information on > 20 million Holsteins (kept by USDA and breed associations) • Semen market global

  5. Molecular genetics

  6. Poultry • Integrated system • Large companies • Hierarchical breeding structure • Final product - crossbreds • Useful records on up to 200k birds • Recording on: – Number and size of eggs – Growth (weights at certain ages) – Fertility – …

  7. Swine • Becoming more integrated • Hierarchical breeding structure • Final product - crossbreds • Recording on: – Growth – Litter size – Meat quality – … • Populations > 200k animals

  8. Beef cattle • Mostly family farms (5 - 50,000) • Many breeds • Final product – purebreds and crossbreds • Recording on: – growth – fertility – meat quality • Data size up to 2 million animals

  9. Results of genetic selection • Milk yield 2 times higher • Chicken – Time to maturity over 2 times shorter – Food efficiency 2 times higher • Swine • Beef • Fish

  10. Genetic value of individual On a population level: g = a + “rest ” Var ( a )= A s a, A - matrix of relationships among animals, s a - additive variance A dense; A -1 sparse and easy to set up

  11. Example of a model • Litter weight = – contemporary group + – age class + – genetic group + – animal + – e – Var( e ) = I s e , var( animal )= A s a – s e , s a - variance components

  12. Mixed model y – vector of records ß – vector of fixed effects u – vector of random effects e – vector of residuals X, Z – design matrices Fixed effects – usually few levels, lots of information Random effects – usually many levels, little information

  13. Mixed model equations R block diagonal with small blocks G block diagonal with small or large blocks

  14. Matrices in Mixed Model • Symmetric • Semi-positive definite • Sparse – 3-200 nonzeros per row (on average) • Can be constructed as a sum of outer products Σ W i Q i W i W i – 1-20 x 20k-300 million, < 100 nonzeroes Q i - small square matrix (less than 20 x 20)

  15. Models • Sire model Y = cg +… sire + e 1000k animals ≈ 50k equations • Animal model Y= cg +..+ animal + e 1000k animals ≈ 1200k equations • Mutiple trait model y 1 y 1 = cg 1 +…+ animal 1 + …+ e 1 1000k animals >3000k equations …. y n y n = cg n +…+ animal n + …+ e n • Random regression model 1000k animals >6000k equations (Longitudinal) y = cg +…+ Σ f i (x)animal i + …+ e y

  16. Tasks • Estimate variance components – Usually sample of 5-50k animals • Solve mixed model equations – Populations up to 20 million animals – Up to 60 unknowns per animal

  17. Data structures for sparse matrices • Linked list • Triples as Hash • IJA – (row pointer to columns and values) • Matrix not stored

  18. Data structures

  19. Solving strategies • Sparse factorization • Iteration – Gauss Seidel – Gauss Seidel + 2 nd order Jacobi – PCG • Conditioners from diagonal to incomplete factorization

  20. Gauss-Seidel Iteration & SOR Ax = b -Simple -Stable and self correcting -Converges for semi-positive definite A -For balanced cross-classified models converges in one round -Small memory requirements -Hard to implement matrix-free -Slow convergence for complicated models

  21. Preconditioned Conjugate Gradient Large memory requirements Tricky implementation Easy to implement matrix-free Usually converges a few times faster than SOR even with diagonal preconditioner! Matrix-free iteration Let A = Σ (Wi’ Wi) nonzeros(W) << nonzeros(A) W easily generated from data Ax = Σ (Wi’ Wi x)

  22. Methodologies to estimate variance components • Restricted Maximum Likelihood (REML) • Monte Carlo Markov Chain

  23. REML Φ - variance components (in R and G) C* - LHS converted to full rank Maximizations Derivative free (use sparse factorization) First derivative (Expectation maximization; use sparse inverse) Second derivative D 2 and E(D 2 ) hard to compute but [D 2 and E(D 2 ) ]/2 simpler – Average Information REML

  24. Sparse matrix inversion • Takahashi method: • Can obtain inverse elements only for elements where L ≠ 0 • Inverses obtained for sparse matrices as large as 1000kx1000k • Cost ≈ 2 x sparse factorization

  25. REML, properties • Derivative free methods reliable only for simple problems • Derivative methods – Difficult formulas • Nearly impossible for nonstandard models – High computing cost ≈ quadratic – Easy determination of termination

  26. Bayesian Methods and MCMC

  27. Samples

  28. Approximate P(s e |y)

  29. Approximate (p(s p |y)

  30. Approximate (p(s a |y)

  31. MCMC, properties – Much simpler formulas – Can accommodate • large models • Complicated models – Can take months if not optimized – Details important and hard decision when to stop

  32. Optimization in sampling methods • 10k-1 million samples • Slow if equations regenerated each round • If equations can be represented as: [R ⊗ X1 + G ⊗ X2] = R ⊗ y R,G – estimated small matrices, X1, X2, y – constant X1, X2 and y can be created and stored once. Requires tricks if models different/trait and missing traits

  33. Software • SAS (fixed models or small mixed models) • Custom • Packages – PEST, VCE (Groeneveld et al) – ASREML (Gilmour) – DMU (Jensen et al) – MATVEC (Wang et al) – Blupf90 etc. (Misztal et al)

  34. Larger data, more complicated models, simpler computing?

  35. Computing platforms • Past – Mainframe – Supercomputers • Sparse computations vectorize! • Current – PCs + workstations – Windows/Linux,Unix • Parallel and vector processing not important

  36. Random regression model on a parallel processor Madsen et al, 1999; Lidauer et all, 1999) Goal: compute Σ (W i ‘W i x); W i -large sparse vectors • Approaches: Distribute collection of Σ (W i ‘W i )x t o separate processors a) b) Optimize scalar algorithm first to Σ (W i ‘(W i x)) If W i has 30 nonzeros: a) 900 multiplications b) 60 multiplications Scalar optimization more important than brute force parallelization

  37. Other Models • Censored • Survival • Threshold • …..

  38. Issues • 0.99 problem • Sophistication of statistics vs. understanding of problem vs. data editing • Undesired responses of selection – Less fitness,…. – Aggressiveness (swine, poultry) • Challenges of molecular genetics

  39. Molecular Genetics • Attempts to identify effects of genes on individual traits • Simple statistical methodologies • Methodology for joint analyses with phenotypic and DNA data difficult • Active research area

  40. Conclusions • Animal breeding compute intensive – Large systems of equations – Matrices sparse • Research has high large economic value

Recommend


More recommend