ENGINEDB: A repository of functional analogue gene products Giulia De Sario, Angelica Tulipano, Andreas Gisel Istituto di Tecnologie Biomediche, Sede Bari, CNR, Via Amendola 122/D, Bari, Italy BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Functional Analogues 402AA 431AA Sequence Identity: 0,0510441 InterPro - Protease_inhib_I4_serpin. InterPro - EGF. EGF_3. EGF_like_reg_CS. Kringle. Peptidase_S1_S6. Peptidase_S1A. Gene Ontology serine-type endopeptidase activity 0.00139635 blood coagulation 5.07353e-05 fibrinolysis 2.5489e-06 PAI1_HUMAN UROK_HUMAN Urokinase-type plasminogen activator Plasminogen activator inhibitor BITS 2009, Genova Andreas Gisel ITB-Bari CNR
How do we find these functional analogues? BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Gene Ontology • GO is an international standard to annotate genes: – www.genontology.org • is structured as a directed acyclic graph with three independent branches with top-level terms – ‘molecular function’, – ‘biological process’ and – ‘cellular component’ • data are available in a public database GODB www.godatabase.org/dev • more than 4.800.000 gene products are described by the GO terms • more than 27800 GO terms ending up with >24.700.000 associations • Updated about every two months BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Gene Ontology path indirectly associated term directly associated term Semantic similarity measurement P(term) = # gene products associated to the term or any of its children # total associations between all GO terms and gene products Resnik P, J Artif Intelligence Res 1999, 11: 95-130. BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Algorithm • Through a χ ² statistical test we compare two gene product A and B: – we count the number of the GO terms directly or indirectly associated which are common and uncommon to two genes; – we weight each term with 1-p(term), giving more importance to specific terms. # go terms in A # go terms not in A # go terms in B O 11 O 12 # go terms not in B O 21 O 22 • The higher the χ ² value is, the bigger is the probability of functional dependence between the two gene products A and B. BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Data Analysis BCL2_HUMAN Non-redundant list of GO terms Description of gene product 3,7 million gene products (UniProt) are described by 170925 descriptions 27000 CPU hours Gene analogue finder: a GRID solution for finding functionally analogous gene products. Tulipano A, Donvito G, Licciulli F, Maggi G, Gisel A. BMC Bioinformatics. 2007 Sep 3;8:329. BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Results • Analogues Genus - Species Chi Square Common TermsNo Common Terms +----------------------------------------------+-------------+------+ | name | p_value | code | +----------------------------------------------+-------------+------+ • PAI1_HUMAN Homo - sapiens 26350.67188 47 0 | protease binding | 1.8611e-06 | IPI | • FIBR_EISFO Eisenia - fetida 17928.62500 32 0 | protease binding | 1.8611e-06 | IPI | • PLMN_CAPHI Capra - hircus 17407.75586 33 2 | serine-type endopeptidase activity | 0.00139635 | IEA | | serine-type endopeptidase inhibitor activity | 0.000221876 | IEA | • CEKI_CAEEC Caesalpinia - echinata 16403.94922 30 1 | serine-type endopeptidase inhibitor activity | 0.000221876 | EXP | • Serpinf2 Rattus - norvegicus 15237.88184 33 7 | serine-type endopeptidase inhibitor activity | 0.000221876 | EXP | • Tmprss6_predicted Rattus - norvegicus 15092.10352 36 12 | protein binding | 0.0116756 | IPI | • UROK_HUMAN Homo - sapiens 14343.85547 38 18 | protein binding | 0.0116756 | IPI | | blood coagulation | 5.07353e-05 | TAS | • NVSP_NERVI Nereis - virens 14327.84277 33 10 | fibrinolysis | 2.5489e-06 | TAS | • PLMN_STRCA Struthio - camelus 14327.76074 33 10 | regulation of angiogenesis | 8.17267e-06 | IEA | • FA12_PIG Sus - scrofa 14327.75977 33 10 | extracellular region | 0.00467537 | NAS | • FA12_BOVIN Bos - taurus 14327.75977 33 10 | extracellular region | 0.00467537 | EXP | | extracellular region | 0.00467537 | EXP | • FA12_CAVPO Cavia - porcellus 14327.75977 33 10 | plasma membrane | 0.0123278 | EXP | • FIBC_LUMRU Lumbricus - rubellus 14129.44141 32 9 +----------------------------------------------+-------------+------+ • PLMN_PIG Sus - scrofa 13983.45605 33 11 • PLMN_PETMA Petromyzon - marinus 13983.45605 33 11 • FA12_HUMAN Homo - sapiens 13983.43848 33 11 • TFPI1_HUMAN Homo - sapiens 13799.77637 26 1 • ANTA_HYDMA Hydra - magnipapillata 13774.76465 29 5 • ANTA_HAEOF Haementeria - officinalis 13774.76465 29 5 +------------------------------------+-------------+------+ • KLKB1_HUMAN Homo - sapiens 13655.81934 33 12 | name | p_value | code | • Klkb1 Mus - musculus 13655.81934 33 12 +------------------------------------+-------------+------+ | serine-type endopeptidase activity | 0.00139635 | IEA | • KLKB1_BOVIN Bos - taurus 13655.81934 33 12 | peptidase activity | 0.0169393 | IEA | • Klkb1 Mus - musculus 13655.81934 33 12 | response to hypoxia | 1.81255e-05 | IEA | • DISA_AGKCO Agkistrodon - contortrix 13589.35840 27 3 | proteolysis | 0.00855986 | TAS | • DISB_VIPLE Macrovipera - lebetina 13589.35840 27 3 | chemotaxis | 0.0009295 | TAS | | signal transduction | 0.0121321 | TAS | • VSP2_TRIEL Protobothrops - elegans 13346.19141 33 13 | blood coagulation | 5.07353e-05 | IEA | • VSP1_TRIEL Protobothrops - elegans 13346.19141 33 13 | smooth muscle cell migration | 1.3756e-06 | IEA | • f7i Danio - rerio 13272.62012 33 13 | fibrinolysis | 2.5489e-06 | IEA | • PLMN_ERIEU Erinaceus - europaeus 13218.69531 34 15 | extracellular region | 0.00467537 | IEA | | plasma membrane | 0.0123278 | EXP | • PLMN_MACEU Macropus - eugenii 13218.69531 34 15 +------------------------------------+-------------+------ BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Data filtering Comparing 170925 descriptions would produce 14,6*10 9 results. We introduced wo threshold to limit the data to significant results: a) On χ ²-value 77 25000 20000 15000 77 10000 5000 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 a) On average p-value of common terms BITS 2009, Genova Andreas Gisel ITB-Bari CNR
The Access http://spank.ba.itb.cnr.it/engine/ BITS 2009, Genova Andreas Gisel ITB-Bari CNR
The Access http://spank.ba.itb.cnr.it/engine/ BITS 2009, Genova Andreas Gisel ITB-Bari CNR
The Access http://spank.ba.itb.cnr.it/engine/ BITS 2009, Genova Andreas Gisel ITB-Bari CNR
The Access Webservice http://spank.ba.itb.cnr.it/docs/engineDB.wsdl <soapenv:Envelope <s> xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:fun="http://cathdb.info/FuncNet_1_0/"> <p1>P30273</p1> <soapenv:Header/> <p2>O00241</p2> <soapenv:Body> <fun:ScorePairwiseRelations> <rs>14321.45508</rs> <proteins1> <pv>0.18345472940256</pv> <p>A3EXL0</p> <p>Q8NFN7</p> </s> <p>O75865</p> <p>Q5SRD3</p> <p>Q9Y5G3</p> <p>O60486</p> <p>P19012</p> FuncNet is an open platform for the prediction and comparison of <p>Q9NWG8</p> <p>P30273</p> protein function, funded by the European Union’s <p>Q92817</p> EMBRACE Network of Excellence, and developed in partnership with </proteins1> <proteins2> the ENFIN project. <p>Q5SR05</p> <p>Q9H8H3</p> <p>P22676</p> <p>O00241</p> It is designed to answer questions like: <p>O14498</p> <p>P78552</p> <p>Q8NF37</p> Given one set of proteins which are known to share a particular <p>Q8NGM6</p> <p>Q0ZAJ7</p> biological function… <p>Q6PIM1</p> </proteins2> </fun:ScorePairwiseRelations> </soapenv:Body> … which of these other proteins also share that function? </soapenv:Envelope> http:/funcnet.eu/ BITS 2009, Genova Andreas Gisel ITB-Bari CNR
The Access Webservice http://spank.ba.itb.cnr.it/docs/engineDB.wsdl BITS 2009, Genova Andreas Gisel ITB-Bari CNR
In Future • Compatible with different gene product identifiers • Sequence comparison • Domain comparison • Select specific organisms • Search with user defined keywords BITS 2009, Genova Andreas Gisel ITB-Bari CNR
Recommend
More recommend