diogene a plant breeding software
play

DIOGENE A Plant Breeding Software Users Students (Master, Thesis) - PowerPoint PPT Presentation

DIOGENE A Plant Breeding Software Users Students (Master, Thesis) Confirmed researchers (INRA, CIRAD, Laval University) Tree Breeding managers, technicians and & engineers (INRA, CIRAD, CEMAGREF) Present state


  1. DIOGENE A Plant Breeding Software

  2.  Users  Students (Master, Thesis)  Confirmed researchers (INRA, CIRAD, Laval University)  Tree Breeding managers, technicians and & engineers (INRA, CIRAD, CEMAGREF…)  Present state  Integration of General Biometry , Quantitative & Population Genetics  Modular Structure  Original models (Genotype x Environment interaction, Selection indices, Spatial statistics: Papadakis++…)  Usable both in in interactive mode and by building complex ‘ processing sequences ’ (automatic generation of scripts )  Multivariable and non-orthogonal (MANOVA, Selection indices, Data Analysis…)  Simultaneous processing of quantitative and qualitative (0-1) traits  Resampling (Jackknife and Bootstrap) very fast and standardized  Recent improvements (Ph. Baradat and Th. Perrier 2003-2009)  Porting in Fortran 95 and Linux  Contextual input of parameters

  3. Specifications  Integrated software (several programs chained)  Great number of parameters, but most of them are ‘guessed’ (from context)  Ability to process experiments even with strong non-orthogonality  High speed (mandatory for resampling)

  4. The original data file system is adapted to resampling. It is binary, with each data (identifier or observation) coded in single precision (4 bytes). A parameter file suffixed by ‘.p’ is associated. It gives all informations useful for data processing. X vector X zq Identifier 1 … Identifier k X 11 … X 1q … Y vector Identifier 1 … Y 11 … Y 1 q ' … Y z q ' Identifier k A record (X vector), stored into memory at the processing time, is defined by three parameters:  Number of identifiers (k)  Maximum number of individuals (z)  Number of traits observed per individual (q) The traits are referenced by their relative rank within an individual. The parser (see next slide) generates a virtual record (Y vector) with the same structure where the q observed traits are replaced by q’ functions of these traits and/or already defined functions (recursivity). Structure and use of data file record (1 )

  5. Schematic conception of the parser Tetrad 1 Tetrad 2 operator address of operand 1 operand 2 operator or address of operand 1 operand 2 result or ‘0’ ‘end stack’ result or ‘0’ code Generation of binary data (presence/absence) from the ‘y’ studied traits The incidence matrix (0-1) of the binary data is managed by a specialised language or by internal routines (e.g. for molecular markers involving thousands of traits). Number of addressed column = y value ‘studied trait’ = line number 1 2 3 4 5 6 7 Number of the y1 1 0 1 0 0 1 0 addressed line = y2 0 0 1 1 0 0 1 rank of the trait y3 1 0 0 0 0 0 0 y4 0 0 0 1 1 0 0 y5 1 1 1 0 0 0 0 y6 0 0 0 0 0 1 1 Structure and use of data file records (2)

  6. The ‘ y’ variables are defined in the form: y (j)= F[ x (1), x (2)... y (i), ctes]. According to this principle, the logarithm of the volume increment of a cone may be written: log((x3**2*x4-x1**2*x2)*pi/3). if (initial radius & height) and (final radius & height) are, in that order, the four ‘x’ variables. Missing data are coded by ‘-9’ ou ‘-5’ according to the individual is dead or that the trait cannot be observed for another reason. Every individual whom at least one of the ‘ x’ variables which are required to define a ‘ y’ variable has one of these two values is excluded from the processing. Lastly, if n is the number of individuals per record, and n < z , a ‘logical end of record’ signal is coded by ‘9999’. Structure and use of data file records (3)

  7. LENA1 LENA2 LENOR non non oui oui Contrôle ORION parenté ? 2 ancêtres ? A1 A2 A'1 A'2 Etat dispos. Plan dispos. Fichier dispositif Σ D2 D1 TIMBAL POLY REPLAN DEBLOC Plan Plan Etiquettes mis à jour compacté Fichier restructuré General flowchart of programs for creation/management of field trials (1)

  8. The programs create random incomplete block trials which take into account environmental constraints met in the field, with a coordinate localization of individuals. Geometry of blocks and plots can be parametrized. Relativness between individuals of the same block may be controlled in the case of seedling seed orchards. Tn this case, the program checks for every new individual (D1) randomly drawn, that none individual among those already drawn in the block ( D2) have in common one or two common  A1¹A' 1  Ç  A1¹A' 2  Ç  A2¹A' 1  Ç  A2¹A' 2  . The algorithm of random ancestors using the constraint: drawing of individuals from each genetic unit for allocation to blocs is deviced so that: Pr  D ij = n i / N where Dij is an individual or a plot of the Di genetic unit of size ni . during random drawing, if N individuals or plots are involved. This principle allows generation of trials optimized even with genetic units having very different sizes. General flowchart of programs for creation/management of field trials (2)

  9. General flowchart of programs for Biometry and Genetics S u pe rviseur (O P E P ) O p tion s M E N U S (A N T A R ) FIC H IE R D IS T R IB donn ée s A n alys. syn taxiq ue (D E F C A R ) IN T E R G -G IN T E R G -E Etude distrib. A JU S T G é nétique d es A F C A N V A R M effets fixé s po pulatio ns R E G M A C P C O V A R M su r in div. su r in div. C O R A N A F D IN D E X C o m pa r.effets C o rré l.d e ran g C L A S S (d e ndrog r.) A C P su r co rrél. de ra n g A C P su r e ffets R E G M su r e ffets

  10. Some characteritics which make DIOGENE original and useful (1)  Modular Structure (‘à la carte’ models)  Complex adjustment to environment including multisite trials (Papadakis++)  MANOVA models including individual contribution to G x E Interaction  MANOVA + Discriminant Analyses corresponding to models model (eg. Diallel)  selection Indices including choice of predictors and target traits with easy weighting etc…  Choice of standardized data file allowing:  A selective processing of selected lines (records)  Great processing quickness (important for resampling) = ‘ ANTAR ’ which integrates: - Data on a binary direct access file - All informations on the data (associated parameter file)

  11. Some characteristics… (2)  A management of data processing by ‘scripts’  Easy to create to correct & to modify (usable in different context)  Allowing creation of scripts for complex computations  Generalized resampling concerning chains of programs  Jackknife  Bootstrap  Each method may be used at individual or genetic entry levels By choice of:  The first and the last programs of the sequence  Where is done the resampling (‘Upstream’ parameter)  The level: individul or genetic entries (family, provenance…)  Other kinds of reiterated computations (Papadakis++…)

  12. R e s a m p l i n g ( 1 ) - The Jackknife method (1) One discards successively individuals of ranks 1 to u , u +1 to 2 u,… ( k -1) u +1 to ku . It is possible to discard only one individual by subsample: k = N , u =1. If u>1, the sub-sample must be représentative of the total population (all levels of factors). This may be realized by random permutation of the initial ranks of individuals. Each individual is associated to n variables : y 1 , y 2 ... y n and one computes on the population a general function of these variables, F ( y 1 , y 2 ,... y n ).

  13. R e s a m p l i n g ( 2 ) - T h e J a c k k n i f e m e t h o d This function of observations is re-computed from each sub-sample. The positive autocorrelation between the sub-samples, with ( k -2) u individuals in common, would lead to underestimate the error variance of the parameter estimate. An unbiased estimate of this error variance (Quenouille-Tukey’s estimate) is given by:  2  k ∑ = F   1 i ˆ2 i 1 k 2 = − ∑ =  F i  S i 1 − k ( k 1 ) k     where: (Tukey’s pseudo-value); * = − − k F  ( k 1 ) F F i i F * is the value of the parameter computed on the subsample of rank i where i individuals of ranks u ( i -1)+1 to ui are removed; F is the parameter’s value computed on the total sample ( ku individuals).  ˆ − F E ( F ) These pseudo-values are independent variables and the statistic: ˆ S follows the Student’s t distribution with k -1 degrees of freedom.

Recommend


More recommend