MySQL – A Little Science Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de
Structure § SCOP tables in detail § SCOP entity relation diagram § Complex queries to SCOP § How many nodes for class, fold, superfamily and family? § Which superfamily has the most families? § Which families does the DNA binding-domain superfamily have? § Retrieve PDB ids for a given family/superfamily? § What is the percentage of superfamilies with just one family? § What is the PDB structure with the biggest number of distinct superfamilies? § What is the percentage of PDBs with just one superfamily? § What are the most popular superfamilies (biggest number of PDB entries) ? § Selections of selections § Creting temporary tables
A Little Science § When working with SCOP through the web interface we are limited in what we can ask § What can we get out of SCOP when it is available as a relational table? § A reminder § Classes: all alpha, all beta, alpha/beta, alpha+beta § SCOP family: >30% sequence similarity § SCOP superfamily: good structural similarity (possibly <30%)
Sequence vs structure
SCOPE tables § Three tables: § cla , PDB entry and reference to its class, fold, superfamily, family, domain § des , description of each node in the SCOP hierarchy § astral , sequence for a domain
SCOPE tables mysql> SELECT * FROM cla LIMIT 1; +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ mysql> SELECT * FROM des LIMIT 1; +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ mysql> SELECT * FROM astral LIMIT 1; +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+
Entity relationship diagram for SCOP
Complex queries to SCOP
A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family?
A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family? Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ § Let us first find out how these types are called: +------+ | type | SELECT DISTINCT type +------+ | cl | FROM des; | cf | | sf | | fa | | dm | | sp | | px | +------+
A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family?
A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family? +------+-------+ SELECT type, COUNT(*) AS num | type | num | FROM des +------+-------+ | cl | 11 | GROUP BY type | cf | 1067| ORDER BY num; | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+
A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family? +------+-------+ SELECT type, COUNT(*) AS num | type | num | FROM des +------+-------+ | cl | 11 | GROUP BY type | cf | 1067 | ORDER BY num; | sf | 1678 | | fa | 2886 | § There are not that many more | dm | 6475 | | sp | 9526 | families than superfamilies. | px | 65122 | +------+-------+ § Which superfamily has the most families?
A Little Science § Which superfamily has the most families? Cla (domain table) +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+
A Little Science § Which superfamily has the most families? SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num DESC; +---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |
A Little Science § Which families does the DNA binding-domain superfamily have? Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ § The sccs of the superfamily is a.4.5. § Its families have sccs ,… a.4.5.1, a.4.5.2 § How can we list them?
A Little Science Which families does the DNA binding-domain superfamily have? SELECT DISTINCT sccs, description FROM des WHERE sccs LIKE “a.4.5.%” AND type=“fa“ ORDER BY sccs; | sccs | description +----------+------------------------------------------------ | a.4.5.1 | Biotin repressor-like | a.4.5.10 | Replication initiation protein | a.4.5.11 | Helicase DNA-binding domain | a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain | a.4.5.13 | Linker histone H1/H5 | a.4.5.14 | Forkhead DNA-binding domain | a.4.5.15 | DNA-binding domain from rap30 | a.4.5.16 | C-terminal domain of RPA32 | a.4.5.17 | Cell cycle transcription factor e2f-dp | a.4.5.18 | The central core domain of TFIIE beta | a.4.5.19 | Z-DNA binding domain | a.4.5.2 | LexA repressor, N-terminal DNA-binding domain | a.4.5.20 | P4 origin-binding domain-like | a.4.5.21 | ets domain | a.4.5.22 | Heat-shock transcription factor | a.4.5.23 | Interferon regulatory factor [...] => 45 rows
A Little Science § Find example PDB structure for the families of the DNA binding-domain superfamily ... mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1"; +--------+ | pdb_id | +--------+ | 1bia | | 1hxd | | 1bib | | 1j5y | +--------+ mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2"; +--------+ | pdb_id | +--------+ | 1jhf | | 1jhh | | 1lea | | 1leb | +--------+
A Little Science Structure very similar! Biotin represor-like LexA represor 1jhf (a.4.5.2) 1bia (a.4.5.1)
Sequence vs structure
A Little Science 1cgp ( a.4.5.4)
A Little Science Some more… 1b9n (a.4.5.8) 1smt (a.4.5.5) 1f4k (a.4.5.7) 1bm9 (a.4.5.7) 1hw1 (a.4.5.6)
A Little Science How many percent of superfamilies have only § 1 family, how many 2,… ?
A Little Science § How many percent of superfamilies have only 1, 2, 3, … families? 1. First let’s create the result of the query that found the number of families for each superfamily. SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc +---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |
Recommend
More recommend