COMP60411: Modelling Data on the Web Tree Data Models Week 2 Tim Morris & Uli Sattler University of Manchester � 1
Reminder: Plagiarism & Academic Malpractice • We assume that you have all by now successfully completed the Plagiarism and Malpractice Test • ...if you haven’t: do so before you submit any coursework (assignment or assessment) • ...because we work under the assumption that – you know what you do – you take pride in your own thoughts & your own writing – you don’t steal thoughts or words from others • ...and if you don’t, and submit coursework where you have copied other people’s work without correct attribution it costs you at least marks or more, e.g., your MSc � 2
Reminder We maintain 3 sources of information: • syllabus … /pgt/COMP60411/syllabus/ • materials … /pgt/COMP60411/ – growing continuously – with slides, reading material, etc – with TA lab times • Blackboard via myManchester – growing continuously Subscribe – Forums Read • General Contribue • Week 1, Week 2, … – Coursework � 3
Coursework - Week 1 • Q1: looks good, will look better next week, BUT … • SE1: looks mostly good • use a good spell & grammar checker! • answer the question! • We know what ER diagrams are, no need to explain them • No need to explain logical/physical model in detail! • avoid non sequiturs • M1: • … • CW1: • … • For all: • check our feedback in the rubrics • if you can’t find them, ask us in labs • start in time � 4
Today We will encounter many things: Tree data models: 1. Data Structure formalisms: JSON 2. Schema Language: JSON Schema 3. Data Manipulation: Python, JSON package General concepts: • Semi-structured data • Self-Describing • Trees • Regular Expressions • Internal & External Representation, Parsing, Serialisation • Validation, valid, … • Format � 5
Extending Last Week’s Running Example � 6
Extended Running Example • Remember last week’s example: – per person 1-3 data records, with address, phone, email, … • now combine this with management information : – who supervises/line manages whom? Management Employees Manager ID Managee ID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234567 1234124 1234124 M2 3OZ Manchester … 1234124 1234567 1234567 SW1 A London … ... ... ... ... ... ... • … what could go wrong? • … what did go wrong? � 7
Running Example (2) • Take a few minutes and sketch this SQL query: Q1: all postcodes of 4th-level managers Employees Management Manager ID ManageeID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234124 M2 3OZ Manchester … 1234567 1234124 1234567 SW1 A London … 1234123 1234567 ... ... ... ... ... ... � 8
Management Q1: Tricky.. ManagerID ManageeID 1234124 1234123 1234567 1234124 … … Q1’: Postcodes of all managers: SELECT Postcode FROM Employees E, Management M WHERE E.EmployeeID = M.ManagerID Q1’’: Postcode of 2nd level managers: SELECT Postcode FROM Employees E INNER JOIN (SELECT ManagerID FROM Management M1, Management M2 WHERE M1.ManageeID = M2.ManagerID) M ON E.EmployeeID = M.ManagerID … more and more joins! � 9
Running Example (2) • Take a few minutes and sketch this SQL query: Q2: “error” if we have a cyclic management structure Employees Management Manager ID ManageeID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234124 M2 3OZ Manchester … 1234567 1234124 1234567 SW1 A London … 1234123 1234567 ... ... ... ... ... ... � 10
Management Q2: Tricky … ManagerID ManageeID 1234124 1234123 1234567 1234124 … … – Detecting management cycles of length 1: SELECT EmployeeID FROM Management M WHERE M.ManageeID = M.ManagerID – Detecting management cycles of length 2: SELECT EmployeeID FROM Employees E1 INNER JOIN (SELECT EmployeeID FROM Management M1, Management M2 WHERE M1.ManageeID = M2.ManagerID) M ON E1.EmployeeID = M.ManagerID – … where do we stop? � 11
A new example: UniProt, a Protein Database • A research community based & curated knowledge base of – 550K protein sequences, – comprising 192M amino acids – abstracted from 220K references. • Proteins largely determine how (parts of) living things work and interact – how/where diseases work • Used for a variety of research into – (causes of) diseases – genetics – (personalized) drugs – … � 12
� 13
Protein data from UniProt UniProt • provides a web query interface to Uniprot DB, – manual – programmatic • e.g., query http://www.uniprot.org/uniprot/ for ‘BRCA’ • … biologists need to integrate, share, query, analyse, and search this data • ...so what format is/should it be in? • ...or what format should it be made available in to be integrated with other data? � 14
Protein data from UniProt in as text � 15
Protein data from UniProt in a table (1) ... Protein Short Alterna Altern Altern Gene Gene Gene ... Organi Taxon 1 Taxon 2 Full Nam tive ative ative 1 2 3 sm Name e Name Name Name 1 2 3 ... Fancon BRCA BRCA BRIP1 BACH FANC Haloru Viruses dsDNA FACJ ATP- i 1- 1- 1 J brum viruses, depend anemia interac interac phage no RNA ent group J ting ting HF2 stage RNA protei protei helicase n C- n 1 BRIP1 termin ... N/A N/A N/A N/A helica N/A N/A Gallus Eukary Metazoa ATP- se gallus / ota depend Chicke ent n helicas e ... ... ... ... ... ... ... ... ... ... ... ... ... � 16
Protein data from UniProt in many tables (2) Proteins Protein Full Short Organism ... Protein-names ID Name Name Protein Alternative Name ID 1234123 Fanco FACJ Halorubru ... ni m phage anemi HF2 1234123 ATP-dependent a RNA helicase 1234567 ATP- N/A Gallus ... BRIP1 depen gallus / dent Chicken 1234123 BRCA1-interacting helicas protein C-terminal ... ... ... ... helicase 1 Protein-genes 1234123 BRCA1-interacting Protein Genes protein 1 ID 1234123 BRIP1 ... 1234123 BACH1 ... 1234567 helicas e ... too many joins! � 17
Protein data from UniProt in JSON [{"accession": "Q9NXB0-3", "id": "MKS1-3_HUMAN", "proteinExistence": "Evidence at protein level", "info": { "type": "Swiss-Prot", "created": "2006-03-07", "modified": "2019-09-18", "version": 135 }, "organism": { "taxonomy": 9606, "names": [{ "type": "scientific", "value": "Homo sapiens" }, { "type": "common", "value": "Human" }], "lineage": ["Eukaryota", "Metazoa", "Chordata", "Craniata", "Vertebrata", "Euteleostomi", "Mammalia", "Eutheria", "Euarchontoglires", "Primates", "Haplorrhini", "Catarrhini", "Hominidae", “Homo”] … � 18
Protein data from UniProt in an XML doc (1) <?xml version="1.0" encoding="UTF-8"?> <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"> <entry dataset="Swiss-Prot" created="2005-01-04" modified="2010-08-10" version="80"> <accession>Q9BX63</accession> <accession>Q3MJE2</accession> <accession>Q8NCI5</accession> <name>FANCJ_HUMAN</name> <protein> <recommendedName ref="1"> <fullName>Fanconi anemia group J protein</fullName> <shortName>Protein FACJ</shortName> </recommendedName> <alternativeName> <fullName>ATP-dependent RNA helicase BRIP1</fullName> </alternativeName> <alternativeName> <fullName>BRCA1-interacting protein C-terminal helicase 1</fullName> <shortName>BRCA1-interacting protein 1</shortName> </alternativeName> <alternativeName> <fullName>BRCA1-associated C-terminal helicase 1</fullName> </alternativeName> </protein> <gene> <name type="primary">BRIP1</name> <name type="synonym">BACH1</name> <name type="synonym">FANCJ</name> </gene> � 19
Recommend
More recommend