Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk Environmental Genomics Thematic Programme Data Centre http://envgen.nox.ac.uk
Modelling and designing a database A well thought out database will consist of a co-ordinated set of tables that satisfy a set of requirements specified by the intended users .
Modelling and designing a database • Define clearly the purpose of the database • Define the data requirements • Prepare a data model • Translate the data model into a database design • Implement the design
Modelling and designing a database • Define clearly the purpose of the database • Define the data requirements • Prepare a data model • Translate the data model into a database design • Implement the design
Modelling and designing a database PEN and PAPER • Define the purpose of the database • Define the data requirements • Prepare a data model • Translate the data model into a database design • Implement the design
Modelling and designing a database • Define clearly the purpose of the database • Define the data requirements • Prepare a data model • Translate the data model into a database design • Implement the design COMPUTER
What’s wrong with a spreadsheet? • Redundant data • Inconsistent data • Quality control • Control over effects of changing the data • Inflexible • Limited sharing of data
My made-up data table Database Sequence Type Name ID Organism Name Swissprot Protein Phosphorylase KPB1_Rabit Rabbit B kinase alpha regulatory chain UniProt Protein Troponin T Trt3_rabit Swissprot Proten Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase
Redundant data Database Sequence Type Name ID Organism Name Swissprot Protein Phosphorylase KPB1_Rabit Rabbit B kinase alpha regulatory chain UniProt Protein Troponin T Trt3_rabit Swissprot Peptide Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase
Inconsistent data Database Sequence Type Name ID Organism Name Swissprot Protein Phosphorylase KPB1_Rabit Rabbit B kinase alpha regulatory chain UniProt Protein Troponin T Trt3_rabit Swissprot Peptide Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase
Quality control Database Sequence Type Name ID Organism Name Swissprot Protein Phosphorylase KPB1_Rabit Rabbit B kinase alpha regulatory chain UniProt Protein Troponin T Trt3_rabit Swissprot Peptide Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase
Control over the effects of changing data Database Sequence Type Name ID Organism Name Swissprot Protein Phosphorylase KPB1_Rabit Rabbit B kinase alpha regulatory chain UniProt Protein Troponin T Trt3_rabit Swissprot Peptide Glycogen PHS2_RABIT rabbit phosphorylase Swissprot protein Troponin I TRIC_RABIT EMBL Nucleotide rabbit muscle OCPHOS2 rabbit phosphorylase mrna dbEST Nucleotide CK829726 Rabbit TrEMBL protein pol protein Q8MJF7 Rabbit EMBL Nucleotide Rabbit OXPKA Rabbit phosphorylase PDB protein_structure Glycogen 1ABB rabit Phosphorylase
Modelling and designing a database • Define clearly the purpose of the database • Define the data requirements • Analyse the requirements and prepare a data model • Translate the data model into a database design • Implement the design
Modelling and designing a database establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement schema and database
Modelling and designing a database establish requirements Data Requirements data analysis Conceptual Data Model database design Logical Schema Specification implement PEN and PAPER schema and database
A systematic approach will lead to a database that meets your needs! poor design == tears later
Establishing requirements Consult with the users of your system and agree upon • what the users need the database to do • what data needs to be stored to accomplish this Include in your considerations the following: • Data that can be derived should not be stored e.g. if you are storing the birth year of a person, you should not also store their age; it can easily be derived • The way you choose to organise the data can be affected by issues such as whether the data may be needed to answer a number of different questions, and whether you anticipate that additional types of data may be added in the future. In the end, you want your database to be useful for what it needs to do!
An artificial example Mouse Feed A Feed B Feed C m1 20g 20g m2 40g m3 10g 10g 20g Mouse FeedA Mouse FeedB Mouse FeedC m1 20g m3 10g m1 20g m3 10g m2 40g m3 10g Mouse FeedType Amount m1 FeedA 20g m1 FeedB 20g m2 FeedB 40g m3 FeedA 10g m3 FeedB 10g m3 FeedC 20g
Define the data requirements What persistent data is to be stored in the database? Write a description of what is required of the database, including the kinds of data that should be stored and what questions users would like to be able to ask. Information about the limits on the data and any issues like whether a certain type of data is required or not should also be included.
Define the data requirements Example: Initial Statement of Database Requirements A database is required that will hold information about entries from public databases of interest along with the type of features in that sequence. The users should be able to search for a database accession number, organism, or public database name and retrieve a listing of all relevant sequences. They should also be able to retrieve any sequences containing particular feature types. For each public database sequence listed, the users would like to see the URL for that database. Questions: • Does a sequence from a particular public database have to be stored in order for an entry for that public database to be added? • Do they want to store the sequence data? (implications for size and keeping up to date – can you set up an automated retrieval system from a local or remote copy of the public database?) • Do the users want to search for just organism names, or do they wish to search for strains as well? • Do they wish to store only formal names, or common names also? • Is there additional information the users would like to retrieve about the organisms, databases, features, etc that are not stated in this initial description?
Does a sequence from a particular public database have to be stored in order for an entry for that public database to be added? No Do they want to store the sequence data? No – automated retrieval preferred Do the users want to search for just organism names, or do they wish to search for strains as well? Strains as well Do they wish to store only formal names, or common names also? Common names as well Is there additional information the users would like to retrieve about the organisms, databases, features, etc that are not stated in this initial description? Yes – they would like to be able to see retrieve a list of public database names stored, along with the type of data that is stored in that database. They would also like to see if the genome sequence of an organism is available. Revised Statement of Database Requirements A database is required that will hold information about entries from public databases of interest along with the type of features in that sequence. The users should be able to search for a database accession number, organism name, formal or common , and/or strain , or public database name and retrieve a listing of all relevant sequences as well as whether the full genome sequence of the organism in question is publicly available . They should also be able to retrieve any sequences containing particular feature types. For each public database sequence listed, the users would like to see the URL for that database and the type of data it stores . The sequence data itself does not need to be stored in the database.
Recommend
More recommend