Programming in Python Lecture 8: Python Online Michael Schroeder Melissa Adasme 1
Motivation: Access to Web Resources Wildcards possible? Can I filter somewhere? Can I combine two different searches? In most cases NO, since Web GUIs are simplified access points to the data!
Solution: Programmatic Access (Use programmatic access via power user gateways) 1 https://www.ebi.ac.uk/chembl/api/data/molecule?molecule_properties__mw_f reebase__lte=300&pref_name__iendswith=nib 3 2 Example Query (URL) 1 Filtering by selected properties 2 Combination of different criteria 3 Wildcards / Search for substrings Schema of ChEMBL data https://www.ebi.ac.uk/chembl/api/data/molecule/schema
HTTP/REST • HTTP (Hypertext Transfer Protocol) is a protocol/architecture for the internet • specifies how data can be transferred between machines in a network • defines several methods, e.g. GET and POST, DELETE • REST (Representational State Transfer) describes how the architecture of HTTP can/should be used as a uniform interface • REST or REST-like structures available in many web services APIs • Usually defined by URL (web address) and HTTP method (action on that address) http://biowebsitexyz.com/pug/proteins GET List all proteins POST Create new protein entry (with data sent to server) Data is sent separately here, server creates new URL http://biowebsitexyz.com/pug/proteins/p21 GET Get the data for protein 21 DELETE Delete entry for protein 21 on the server
Where can I use it? Non-biologial databases and services etc. Biological databases and services • Uniprot (Sequences) • ENRICHR (Ontology Enrichment) • PubMed (Literature) • PubChem, ChEMBL (chemical structures) • PDB (Structures) • etc.
Constructing Queries http://biowebsitexyz.com/pug/proteins Just the base URL for service GET List all proteins http://biowebsitexyz.com/pug/proteins? Simple filter num_aa_gte=100 GET List all proteins with more than 100 amino acids http://biowebsitexyz.com/pug/proteins? num_aa_gte=100&organism=homo_sapiens Multiple criteria GET List all human proteins with more than 100 amino acids We will focus on GET queries since you mostly will need to just read data from servers
Revision: XML files <Article> <Journal> <ISSN> 0270-7306 </ISSN> <JournalIssue> <Volume> 19 </Volume> ■ We can store any data in XML, <Issue> 11 </Issue> <PubDate> the eXtensible Mark-up <Year> 1999 </Year> Language, e.g. Medline <Month> Nov </Month> </PubDate> ■ Logical data organisation: yes, </JournalIssue> XML schema, which is enforced </Journal> ■ Physical data organisation: None , <ArticleTitle> Differential regulation of the cell wall integrity we cannot optimise retrieval for mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and common queries Ptp3. ■ Hierarchical organization </ArticleTitle> ■ Commonly used as an exchange <Pagination> <MedlinePgn> 7651-60 </MedlinePgn> format for data </Pagination> <Abstract> <AbstractText> Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText> </Abstract> <Affiliation> Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA. </Affiliation> … See also lecture 2
Application I: What‘s the most recent article from the Schroeder group? https://www.ncbi.nlm.nih.gov/pubmed https://www.ncbi.nlm.nih.gov/home/develop/api/
Application I: What‘s the most recent article from the Schroeder group? 1 First we run the main query to obtain all articles from the group (with the author name Michael Schroeder) https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=Michael+Schroeder%5Bauthor%5D Documentation at https://www.ncbi.nlm.nih.gov/pmc/tools/developers/
Application I: What‘s the most recent article from the Schroeder group? 1 First we run the main query to obtain all articles from the group (with the author name Michael Schroeder) https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=Michael+Schroeder%5Bauthor%5D ID of the last article published! Documentation at https://www.ncbi.nlm.nih.gov/pmc/tools/developers/
Application I: What‘s the most recent article from the Schroeder group? 2 Then, using the article ID we get the details for it https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=pubmed&id=31811259&format=xml
Application I: What‘s the most recent article from the Schroeder group? 2 Then, using the article ID we get the details for it https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=pubmed&id=31811259&format=xml Title
Application II: ChEMBL Find compounds with desired properties 1 https://www.ebi.ac.uk/chembl https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services 2 Not the same for all web services!!
Application II: ChEMBL Find compounds with desired properties 1 Let‘s find compounds ending with rin with a MW between 150 and 200
Application II: ChEMBL Find compounds with desired properties 1 Let‘s find compounds ending with rin with a MW between 150 and 200 https://www.ebi.ac.uk/chembl/api/data/molecule? molecule_properties__mw_freebase__gte=150& molecule_properties__mw_freebase__lte=200& pref_name__iendswith=rin Aspirin!!
Application II: ChEMBL Find compounds with desired properties 1 Let‘s find compounds ending with rin with a MW between 150 and 200 : https://www.ebi.ac.uk/chembl/api/data/molecule? molecule_properties__mw_freebase__gte=150& molecule_properties__mw_freebase__lte=200& pref_name__iendswith=rin CC(=O)Oc1ccccc1C(=O)O Canonical SMILES
Application II: ChEMBL Find compounds with desired properties 2 Let‘s find another molecule with aspirin as a substructure: https://www.ebi.ac.uk/chembl/api/data/substructure/CC(=O)Oc1ccccc1C(=O)O (XML result data not shown) Aspirin CC(=O)Oc1ccccc1C(=O)O Documentation at https://www.ebi.ac.uk/chembl/ws
Application II: ChEMBL Find compounds with desired properties 2 Let‘s find another molecule with aspirin as a substructure: https://www.ebi.ac.uk/chembl/api/data/substructure/CC(=O)Oc1ccccc1C(=O)O (XML result data not shown) Aspirin Second hit (CHEMBL7666) CC(=O)Oc1ccccc1C(=O)O Documentation at https://www.ebi.ac.uk/chembl/ws
Important Information With great power comes great responsibility! • Read the document of each service you are using • Sometimes you will need keys to have access • Don‘t send too many requests to the server (you could crash it or be blocked) • some services don‘t allow parallel requests USAGE POLICY: Please note that PUG REST is not designed for very large volumes (millions) of requests. We ask that any script or application not make more than 5 requests per second, in order to avoid overloading the PubChem servers. If you have a large data set that you need to compute with, please contact us for help on optimizing your task, as there are likely more efficient ways to approach such bulk queries. https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html
Web Resources in Python Part I: Choosing your tools • urllib library for fetching web resources • lxml for parsing XML result files Simple example: Extract all authors for a paper From urllib.request import urlopen #module to open the url from lxml import etree #module to read xml files baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Web Resources in Python Part I: Choosing your tools • urllib library for fetching web resources • lxml for parsing XML result files Simple example: Extract all authors for a paper From urllib.request import urlopen Import the libraries from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Recommend
More recommend