CODATA 2002 A Proposition of XML Format for Proteomics Database Ken’ichi KAMIJO, Toshimasa YAMAZAKI, and Akira TSUGITA Proteomics Research Center, Fundamental Research Labs., NEC Corp. 1
CODATA 2002 Data Format Standardization � Download entries from public DBs as a flat-file � easy for a person to read � different formats for every DB � sometimes needs special access methods and special applications for each format � Needs machine-readable formats for software tools � To boost studies by exchanging data among researchers Activates standardization 2
CODATA 2002 XML format � XML (eXtensible Markup Language) � Highly readable for machine and person � Can represent information hierarchy and relationships � Details can be added right away � Convenient for exchanging data � Easy to translate to other formats � Logical-check by a Document Type Definition (DTD) <tag_source element_growth=“8 weeks”> rice leaf </tag_source> Example 3
CODATA 2002 XML in Bioinformatics "The Extensible Markup Language (XML) is the universal GenBank, format for structured documents and data on the Web." -- EMBL, DDBJ, W3C XML Web site, 2000-07-06. PIR, PDB, etc. Local access User (Researcher) Public DBs Wrapper Converter Easy to handle XML Easy to distribute XML Applications XML Easy to re-use Internet XML XML DB Application Security Gate XML XML Item selection Easy to control Wrapper Private DBs priority level User (Researcher) 4
CODATA 2002 Analysis flow in Life Science Mass Spectrometer Tissue disruption 2DE Mass Spectrometer Tissue disruption 2DE Proteome Extraction (Detector) Spot picking (Detector) Extraction Spot picking Analysis Concentration (LC) (N-/C-terminal seq.) (N-/C-terminal seq.) Concentration (LC) Experiment Sample Experiment Data Design preparation (Analysis) Acquisition Result Data Knowledge Report Analysis Mining Discovery Protein identification Chromosome Related proteins Protein identification Chromosome Related proteins (PMF, PST) Genome Bindings (PMF, PST) Genome Bindings Functions/Structure Functions/Structure 5
CODATA 2002 Conventional XMLs in Life Science DNA array data DNA array data XML (MAGE-ML) (MAGE-ML) Experiment Sample Experiment Data Design preparation (Analysis) Acquisition Gene/Protein Result Data Knowledge Gene/Protein Report Sequence and Analysis Mining Discovery Sequence and Features Features (AGAVE, BSML, (AGAVE, BSML, PSDML, BioML, XML PSDML, BioML, ProML) ProML) 6
CODATA 2002 Our XML-based data model Our XML � Proteome-analysis oriented � Proteome-analysis oriented � Describes � Describes Experiment Sample Experiment Data � Sample preparation � Sample preparation Design preparation (Analysis) Acquisition � Methodology � Methodology � 2D gel image / LC results � 2D gel image / LC results � Spot information � Spot information � Sequence and feature � Sequence and feature � 3D structure � 3D structure Result Data Knowledge Report � Includes other open XMLs � Includes other open XMLs Analysis Mining Discovery used in life science used in life science Now Available : HUP-ML (Human Proteome Markup Language) DTD and Editor 7 http://www.jhupo.org/
CODATA 2002 XML for Proteomics <proteome> � Information Structure: <gel id=“1”> <source_info> <gel_img > Proteome <sample_preparation> Gel info. <gel_conditions> <marker> Source info. <detection> Sample preparation info. <gel_image> <spot id=" 1 "> Methodology info. Gel Image / LC info. <spot id=" 2 "> Spot info. <gel id=“2”> 8
CODATA 2002 Example: By A. Tsugita et al.(2002) Human Kidney Glomerulus Proteome E xtra g lome rula r me sa ng ia l c e ll Ma c ra densa c e ll Gra nule c e ll Affe re nt a rte riole E ffe r e nt ar te r iole Glome rula r e pithe lia l c e lls Me sangial matr ix (podoc yte ) Me sa ng ia l c e ll Bowma n’s c a psule e pithe lia l c e ll Glome rula r Glome rula r ba se me nt e ndothe lia l c e ll me mbra ne Proxima l tubule Nephron Glomerulus e pithe lia l c e ll 9
CODATA 2002 Sample of ProteomeXML (1) Source information - <source_info source_info_ID=“HKG-1" - <source_info source_info_ID=“HKG-1" creDate=" 2002-07-20T12:00:00 " creDate=" 2002-07-20T12:00:00 " modDate=" 2002-08-10T17:20:00 "> modDate=" 2002-08-10T17:20:00 "> <source> Homo sapiens </source> <source> Homo sapiens </source> <common_name> Human </common_name> <common_name> Human </common_name> <strain /> <strain /> <cultiva /> <cultiva /> <cell_line /> <cell_line /> <tissue> Kidney Glomerulus </tissue> <tissue> Kidney Glomerulus </tissue> <plasmid /> <plasmid /> <growth_phase unit=" year "> 48 </growth_phase> <growth_phase unit=" year "> 48 </growth_phase> <induction /> <induction /> <host /> <host /> <description>Normal</description> <description>Normal</description> </source_info> </source_info> 10
Recommend
More recommend