Introduction to Databases Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de
Structure ■ Motivation ■ Types of Databases ■ Introduction to Relational Databases ■ Important Concepts of MySQL ■ Usage and Example Queries
Motivation
Motivation ■ Interaction with frontends / GUIs ■ manual submission of queries, ■ accessing information in the web browser or via tools ■ includes frequent copy&paste of intermediate results, IDs, … ■ switiching between websites, tools ■ Advantages ■ easy to retrieve information if website/tool is well-designed ■ no programming necessary ■ Disadvantages ■ no batch processing possible in many cases ■ queries limited by interface provided by web page / tool ■ difficult/impossible to integrate information from different sites How can we directly access databases?
What actually happens when you retrieve data online? Client Message Web Server 1 à Get home page à Get it and send it 2 ß Send home page ß 3 * Display home page à Send query à Start program that * enter query evaluates query by * press submit accessing database… 4 Display result Send result … Compose result web ß ß page and send it
The flood of biomedical data… ■ Since 1980, the number and size of biomedical databases has been growing exponentially. ■ How can you find sources of information you are seeking? ■ Nucleic Acids Research Database Issue in January of every year (oxfordjournals.org/nar) ■ Wikipedia Article: List of Biological Databases Susan B. Davidson, Biol537/CIS636, Fall 2003
Early Databank Format of PDB Early Databank Format of PDB ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70 ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70 Exemplary PDB ATOM record Exemplary PDB ATOM record http://ibiblio.org/comphist/files/images/CardReaderPunch.jpg http://ibiblio.org/comphist/files/images/CardReaderPunch.jpg
Types of Databases
What databases are about ■ Logical organization of data (focused on application needs) ■ data models, schema design, dictionaries ■ Physical organization of data (focused on database needs) ■ Fast retrieval, indexing, compact storage of data ■ Other requirements: ■ Logging (important to know who did what to the data) ■ Security and access control (important to know who can do what) ■ Transactions and concurrency control (important when more than one person is working on database) ■ Integrity (important to ensure that only valid entries in the database) ■ Recovery (important as hardware and software can sometimes fail
Different types of databases ■ Flat files ■ XML ■ Relational database ■ Object databases ■ Object relational databases ■ Graph databases
Flat files ID BTBPTIG standard; genomic DNA; MAM; 3998 BP. XX ■ We can store any data in a flat AC X03365; K00966; XX file, e.g. EMBL SV X03365.1 XX DT 18-NOV-1986 (Rel. 10, Created) ■ But is this a database? DT 20-MAY-1992 (Rel. 31, Last updated, Version 3) XX ■ Logical data organisation: DE Bovine pancreatic trypsin inhibitor (BPTI) gene XX None, unless we define one KW Alu-like repetitive sequence; protease inhibitor; trypsin inhibitor. (as done for EMBL) and adhere XX OS Bos taurus (cow) to it, which is not enforced OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; ■ Physical data organisation: OC Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; None , we cannot optimise OC Bos. XX retrieval for common queries RN [1] RP 1-3998 ■ Logging: No RX MEDLINE; 86158754. RX PUBMED; 2420326. ■ Access control: Implicit RA Kingston I.B., Anderson S.; RT "Sequences encoding two trypsin inhibitors occur in through Unix strikingly similar RT genomic environments"; ■ Transaction and concurrency RL Biochem. J. 233(2):443-450(1986). XX control: None RN [2] RX MEDLINE; 84070725. ■ Integrity: None RX PUBMED; 6580617. RA Anderson S., Kingston I.B.; ■ Recovery: If files are backed- RT "Isolation of a genomic clone for bovine pancreatic trypsin inhibitor by up they can be recovered. RT using a unique-sequence synthetic DNA probe."; However, not on the fly
XML files <Article> <Journal> <ISSN> 0270-7306 </ISSN> <JournalIssue> <Volume> 19 </Volume> <Issue> 11 </Issue> ■ We can store any data in XML, <PubDate> the eXtensible Mark-up <Year> 1999 </Year> Language, e.g. Medline <Month> Nov </Month> </PubDate> ■ But is this a database? </JournalIssue> ■ Logical data organisation: </Journal> yes, XML schema, which is <ArticleTitle> Differential regulation of the cell wall enforced integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases ■ Physical data organisation: Ptp2 and Ptp3. None , we cannot optimise </ArticleTitle> retrieval for common queries <Pagination> ■ Logging: No <MedlinePgn> 7651-60 </MedlinePgn> ■ Access control: Implicit </Pagination> <Abstract> through Unix <AbstractText> Mitogen-activated protein kinases ■ Transaction and concurrency (MAPKs) are inactivated by dual-specificity and control: None protein tyrosine phosphatases (PTPs) in yeasts. In ■ Integrity: None Saccharomyces cerevisiae, two PTPs, Ptp2 and ■ Recovery: If files are backed- Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText> up they can be recovered. </Abstract> However, not on the fly <Affiliation> Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA. </Affiliation> …
Introduction to Relational Databases +-------+------+---------+ | id | prof | subject | +-------+------+---------+ +-------+------+---------+ | id | name | subject | | 51221 | bert | bio | +-------+------+---------+ | 55435 | anne | math | | 46458 | rick | bio | +-------+------+---------+ | 46459 | gerd | bio | | 46460 | mary | bio | | 46461 | ella | math | | 14982 | anne | math | +---------+------+-----+------+ | 46462 | paul | math | | subject | room | day | time | +-------+------+---------+ +---------+------+-----+------+ | bio | A | mo | 3pm | | math | B | tue | 1pm | +---------+------+-----+------+
Relational Database ■ Central Idea: Data as relations in a table ■ E.g. SCOP, Structural Classification of Proteins +-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+
Relational Database ■ Central Idea: Data as relations in a table ■ E.g. Employee +-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+
Relational Database ■ Central Idea: Data as relations in a table ■ E.g. pets +----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+
Relational Database ■ Central Idea: Data as relations in a table ■ E.g. school +-------+------+---------+ | id | prof | subject | +-------+------+---------+ +-------+------+---------+ | id | name | subject | | 51221 | bert | bio | +-------+------+---------+ | 55435 | anne | math | | 46458 | rick | bio | +-------+------+---------+ | 46459 | gerd | bio | | 46460 | mary | bio | | 46461 | ella | math | | 14982 | anne | math | +---------+------+-----+------+ | 46462 | paul | math | | subject | room | day | time | +-------+------+---------+ +---------+------+-----+------+ | bio | A | mo | 3pm | | math | B | tue | 1pm | +---------+------+-----+------+
Recommend
More recommend