comp60411 modelling data on the web tree data models week
play

COMP60411: Modelling Data on the Web Tree Data Models Week 2 Tim - PowerPoint PPT Presentation

COMP60411: Modelling Data on the Web Tree Data Models Week 2 Tim Morris & Uli Sattler University of Manchester 1 Reminder: Plagiarism & Academic Malpractice We assume that you have all by now successfully completed the


  1. COMP60411: Modelling Data on the Web 
 Tree Data Models Week 2 Tim Morris & Uli Sattler University of Manchester � 1

  2. Reminder: Plagiarism & Academic Malpractice • We assume that you have all by now successfully completed the 
 Plagiarism and Malpractice Test • ...if you haven’t: 
 do so before you submit any coursework (assignment or assessment) • ...because we work under the assumption that – you know what you do – you take pride in your own thoughts & your own writing – you don’t steal thoughts or words from others • ...and if you don’t, and submit coursework where you have 
 copied other people’s work without correct attribution 
 it costs you at least marks or more, e.g., your MSc � 2

  3. Reminder We maintain 3 sources of information: • syllabus … /pgt/COMP60411/syllabus/ • materials … /pgt/COMP60411/ – growing continuously – with slides, reading material, etc – with TA lab times • Blackboard via myManchester – growing continuously Subscribe – Forums Read • General Contribue • Week 1, Week 2, … – Coursework � 3

  4. Coursework - Week 1 • Q1: looks good, will look better next week, BUT … • SE1: looks mostly good • use a good spell & grammar checker! • answer the question! • We know what ER diagrams are, no need to explain them • No need to explain logical/physical model in detail! • avoid non sequiturs • M1: • … • CW1: • … • For all: • check our feedback in the rubrics • if you can’t find them, ask us in labs • start in time � 4

  5. Today We will encounter many things: Tree data models: 1. Data Structure formalisms: JSON 2. Schema Language: JSON Schema 3. Data Manipulation: Python, JSON package General concepts: • Semi-structured data • Self-Describing • Trees • Regular Expressions • Internal & External Representation, Parsing, Serialisation • Validation, valid, … • Format � 5

  6. Extending Last Week’s Running Example � 6

  7. Extended Running Example • Remember last week’s example: – per person 1-3 data records, with address, phone, email, … • now combine this with management information : – who supervises/line manages whom? Management Employees Manager ID Managee ID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234567 1234124 1234124 M2 3OZ Manchester … 1234124 1234567 1234567 SW1 A London … ... ... ... ... ... ... • … what could go wrong? • … what did go wrong? � 7

  8. Running Example (2) • Take a few minutes and sketch this SQL query: Q1: all postcodes of 4th-level managers 
 Employees Management Manager ID ManageeID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234124 M2 3OZ Manchester … 1234567 1234124 1234567 SW1 A London … 1234123 1234567 ... ... ... ... ... ... � 8

  9. Management Q1: Tricky.. ManagerID ManageeID 1234124 1234123 1234567 1234124 … … Q1’: Postcodes of all managers: SELECT Postcode FROM Employees E, Management M WHERE E.EmployeeID = 
 M.ManagerID Q1’’: Postcode of 2nd level managers: SELECT Postcode FROM Employees E INNER JOIN (SELECT ManagerID 
 FROM Management M1, Management M2 WHERE M1.ManageeID = M2.ManagerID) M ON E.EmployeeID = M.ManagerID … more and more joins! � 9

  10. Running Example (2) • Take a few minutes and sketch this SQL query: Q2: “error” if we have a cyclic management structure Employees Management Manager ID ManageeID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234124 M2 3OZ Manchester … 1234567 1234124 1234567 SW1 A London … 1234123 1234567 ... ... ... ... ... ... � 10

  11. Management Q2: Tricky … ManagerID ManageeID 1234124 1234123 1234567 1234124 … … – Detecting management cycles of length 1: SELECT EmployeeID FROM Management M WHERE M.ManageeID = 
 M.ManagerID – Detecting management cycles of length 2: SELECT EmployeeID FROM Employees E1 INNER JOIN (SELECT EmployeeID 
 FROM Management M1, Management M2 WHERE M1.ManageeID = M2.ManagerID) M ON E1.EmployeeID = M.ManagerID – … where do we stop? � 11

  12. A new example: UniProt, a Protein Database • A research community based & curated knowledge base of – 550K protein sequences, – comprising 192M amino acids – abstracted from 220K references. • Proteins largely determine how (parts of) living things 
 work and interact – how/where diseases work • Used for a variety of research into – (causes of) diseases – genetics – (personalized) drugs – … � 12

  13. � 13

  14. Protein data from UniProt UniProt • provides a web query interface to Uniprot DB, – manual – programmatic • e.g., query http://www.uniprot.org/uniprot/ for ‘BRCA’ • … biologists need to integrate, share, query, analyse, and search this data • ...so what format is/should it be in? • ...or what format should it be made available in to be integrated with other data? � 14

  15. Protein data from UniProt in as text � 15

  16. Protein data from UniProt in a table (1) ... Protein Short Alterna Altern Altern Gene Gene Gene ... Organi Taxon 1 Taxon 2 Full Nam tive ative ative 1 2 3 sm Name e Name Name Name 1 2 3 ... Fancon BRCA BRCA BRIP1 BACH FANC Haloru Viruses dsDNA FACJ ATP- i 1- 1- 1 J brum viruses, depend anemia interac interac phage no RNA ent group J ting ting HF2 stage RNA protei protei helicase n C- n 1 BRIP1 termin ... N/A N/A N/A N/A helica N/A N/A Gallus Eukary Metazoa ATP- se gallus / ota depend Chicke ent n helicas e ... ... ... ... ... ... ... ... ... ... ... ... ... � 16

  17. Protein data from UniProt in many tables (2) Proteins Protein Full Short Organism ... Protein-names ID Name Name Protein Alternative Name ID 1234123 Fanco FACJ Halorubru ... ni m phage anemi HF2 1234123 ATP-dependent a RNA helicase 1234567 ATP- N/A Gallus ... BRIP1 depen gallus / dent Chicken 1234123 BRCA1-interacting helicas protein C-terminal ... ... ... ... helicase 1 Protein-genes 1234123 BRCA1-interacting Protein Genes protein 1 ID 1234123 BRIP1 ... 1234123 BACH1 ... 1234567 helicas e ... too many joins! � 17

  18. Protein data from UniProt in JSON [{"accession": "Q9NXB0-3", "id": "MKS1-3_HUMAN", "proteinExistence": "Evidence at protein level", "info": { "type": "Swiss-Prot", "created": "2006-03-07", "modified": "2019-09-18", "version": 135 }, "organism": { "taxonomy": 9606, "names": [{ "type": "scientific", "value": "Homo sapiens" }, { "type": "common", "value": "Human" }], "lineage": ["Eukaryota", "Metazoa", "Chordata", "Craniata", "Vertebrata", "Euteleostomi", "Mammalia", "Eutheria", "Euarchontoglires", "Primates", "Haplorrhini", "Catarrhini", "Hominidae", “Homo”] … � 18

  19. Protein data from UniProt in an XML doc (1) <?xml version="1.0" encoding="UTF-8"?> 
 <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"> 
 <entry dataset="Swiss-Prot" created="2005-01-04" modified="2010-08-10" version="80"> 
 <accession>Q9BX63</accession> 
 <accession>Q3MJE2</accession> 
 <accession>Q8NCI5</accession> 
 <name>FANCJ_HUMAN</name> 
 <protein> 
 <recommendedName ref="1"> 
 <fullName>Fanconi anemia group J protein</fullName> 
 <shortName>Protein FACJ</shortName> 
 </recommendedName> 
 <alternativeName> 
 <fullName>ATP-dependent RNA helicase BRIP1</fullName> 
 </alternativeName> 
 <alternativeName> 
 <fullName>BRCA1-interacting protein C-terminal helicase 1</fullName> 
 <shortName>BRCA1-interacting protein 1</shortName> 
 </alternativeName> 
 <alternativeName> 
 <fullName>BRCA1-associated C-terminal helicase 1</fullName> 
 </alternativeName> 
 </protein> 
 <gene> 
 <name type="primary">BRIP1</name> 
 <name type="synonym">BACH1</name> 
 <name type="synonym">FANCJ</name> 
 </gene> 
 � 19

Recommend


More recommend