xml
play

XML A W3C standard to complement HTML Origins: Structured text SGML - PDF document

S EMISTRUCTURED D ATA AND XML H OW THE W EB IS T ODAY HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations only layout, no semantic information No


  1. S EMISTRUCTURED D ATA AND XML H OW THE W EB IS T ODAY  HTML documents  often generated by applications  consumed by humans only  easy access: across platforms, across organizations  only layout, no semantic information  No application interoperability:  HTML not understood by applications  screen scraping brittle  Database technology: client-server  still vendor specific 2 XML D ATA E XCHANGE F ORMAT  A standard from the W3C (World Wide Web Consortium, http://www.w3.org).  The mission of the W3C „. . . developing common protocols that promote its evolution and ensure its interoperability .. .“.  Basic ideas  XML = data  XML generated by applications  XML consumed by applications  Easy access: across platforms, organizations. 3 3 3 3

  2. P ARADIGM S HIFT ON THE W EB  For web search engines:  From documents (HTML) to data (XML)  From document management to document understanding (e.g., question answering)  From information retrieval to data management  For database systems:  From relational (structured) model to semistructured data  From data processing to data /query translation  From storage to transport 4 T HE S EMISTRUCTURED D ATA M ODEL Bib Object Exchange &o1 complex object Model (OEM) paper paper book references &o12 &o24 &o29 references references author page author title year author http title publisher title author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” atomic object 5 T HE S EMISTRUCTURED D ATA M ODEL  Data is self-describing , i.e. the data description is integrated with the data itself rather than in a separate schema.  Database is a collection of nodes and arcs (directed graph).  Leaf nodes represent data of some atomic type ( atomic objects , such as numbers or strings).  Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node.  Arcs are directed and connect two nodes. 6 6 6 6

  3. T HE S EMISTRUCTURED D ATA M ODEL  Arc labels indicates the relationship between the two corresponding nodes.  The root node is the only interior node without in- arcs, representing the entire database.  All database objects are children of the root node.  Every node must be reachable from the root.  A general graph structure is possible, i.e. the graph need not be a tree structure. 7 S YNTAX FOR S EMISTRUCTURED D ATA Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } Observe: Nested tuples, set-values, oids! 8 S YNTAX FOR S EMISTRUCTURED D ATA May omit oids: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } } 9 9 9 9

  4. V S . R ELATIONAL M ODEL  Missing attributes  Additional attributes  Multiple attribute values (set-valued attributes)  Objects as attribute values  No global schema  only the first characteristics supported by relational model, all others are not 10 10 10 10 V S . R ELATIONAL M ODEL  Semistructured data  Self-describing,  Irregular data,  No a-prioristructure.  Relational DB  Separate schema,  Regular data,  A-prioristructure. 11 11 11 11 XML

  5. I MPORTANT XML S TANDARDS  XSL/XSLT: presentation and transformation standards  RDF: resource description framework (meta-info such as ratings, categorizations, etc.)  Xpath/Xpointer/Xlink: standard for linking to documents and elementswithin  Namespaces: for resolving name clashes  DOM: Document Object Model for manipulating XML documents  SAX: Simple API for XML parsing  XQuery: query language 13 13 13 13 XML  A W3C standard to complement HTML  Origins: Structured text SGML  Large-scale electronic publishing  Data exchange on the web  Motivation:  HTML describes presentation  XML describes content  http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)   HTML4.0 XML SGML 14 14 14 14 F ROM HTML TO XML HTML describes the presentation 15 15 15 15

  6. HTML <h1> Bibliography </h1> <p> <i> Foundationsof Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML describes the presentation 16 16 16 16 XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content 17 17 17 17 W HY ARE WE DB’ ERS INTERESTED ?  It’s data. That’s us.  Database issues:  How are we going to model XML? (graphs).  How are we going to query XML? (XQuery)  How are we going to store XML (in a relational database? object-oriented? native?)  How are we going to process XML efficiently? (many interesting research questions!) 18 18 18 18

  7. E LEMENTS  Tags book, title, author, …  start tag: <book>, end tag: </book>  defined by user / programmer (different from HTML!)  Elements <book>…<book>,<author>…</author>  An element consists of a matching start and end tag and the enclosed content .  Elements can be nested , i.e. content of one element can consist of sequence of other elements. 19 19 19 19 A TTRIBUTES  Attributes can be associated with any element.  Provide additional information about elements.  Attributes can have only one value.  Example <book price = “55” currency = “USD ”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book>  Attributes can also be used to connect elements. 20 20 20 20 N ON - TREE - LIKE XML  So far: only tree-like XML documents, i.e. each element is nested within at most one other element.  Attributes can also be used to create non-tree XML documents.  Attributes with a domain of ID serve as primary keys of elements.  Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element. 21 21 21 21

  8. N ON - TREE - LIKE XML  Example of a non-tree structure <persons> <person personid=“o555”> <name> Jane </name> </person> <person personid=“o456”> <name> Mary </name> <children refs=“o123 o555”</children > </person> <person personid=“o123”mother=“o456”> <name>John</name> </person> </persons> 22 22 22 22 N AMESPACES An XML document can involve tags that come  for multiple sources. One and the same tag can appear in more than  one source. <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table> 23 23 23 23 N AMESPACES Name conflicts can be resolved by prefixing tag  names according to their source. <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td></h:tr> </h:table> <f:table> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> When using prefixes in XML, a namespace for  the prefix must be defined. The namespace must be referenced (via an URI)  in the start tag of an enclosing element . 24 24 24 24

  9. W ELL -F ORMED XML  A well-formed XML document satisfies the following conditions:  Begins with a declaration that it is XML.  Has a single root element that encloses the whole document.  Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element.  standalone =“yes” states that document has no DTD.  In this mode, you can invent your own tags, like in semistructured data model. 25 25 25 25 W ELL -F ORMED XML <?XML version=“1.0” standalone =“yes” ?> <bibliography> <book> <title> Foundations… </title> <author>Abiteboul</author> <author>Hull </author> <author>Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> <book> <title> … </title> . . . </book> … </bibliography> 26 26 26 26 W ELL -F ORMED XML  HTML browsers will display documents with errors (like missing end tags).  The W3C XML specification states that a program should stop processing an XML document if it finds an error.  The main reason is that XML is being consumed by programs rather than by humans (as HTML).  W3C providesa validator that checks whether an XML document is well-formed. 27 27 27 27

Recommend


More recommend