S EMISTRUCTURED D ATA AND XML H OW THE W EB IS T ODAY HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations only layout, no semantic information No application interoperability: HTML not understood by applications screen scraping brittle Database technology: client-server still vendor specific 2 XML D ATA E XCHANGE F ORMAT A standard from the W3C (World Wide Web Consortium, http://www.w3.org). The mission of the W3C „. . . developing common protocols that promote its evolution and ensure its interoperability .. .“. Basic ideas XML = data XML generated by applications XML consumed by applications Easy access: across platforms, organizations. 3 3 3 3
P ARADIGM S HIFT ON THE W EB For web search engines: From documents (HTML) to data (XML) From document management to document understanding (e.g., question answering) From information retrieval to data management For database systems: From relational (structured) model to semistructured data From data processing to data /query translation From storage to transport 4 T HE S EMISTRUCTURED D ATA M ODEL Bib Object Exchange &o1 complex object Model (OEM) paper paper book references &o12 &o24 &o29 references references author page author title year author http title publisher title author author author &o43 &25 &96 1997 last firstname firstname lastname first lastname &243 &206 “Serge” “Abiteboul” “Victor” 122 133 “Vianu” atomic object 5 T HE S EMISTRUCTURED D ATA M ODEL Data is self-describing , i.e. the data description is integrated with the data itself rather than in a separate schema. Database is a collection of nodes and arcs (directed graph). Leaf nodes represent data of some atomic type ( atomic objects , such as numbers or strings). Interior nodes represent complex objects consisting of components (child nodes), connected by arcs to this node. Arcs are directed and connect two nodes. 6 6 6 6
T HE S EMISTRUCTURED D ATA M ODEL Arc labels indicates the relationship between the two corresponding nodes. The root node is the only interior node without in- arcs, representing the entire database. All database objects are children of the root node. Every node must be reachable from the root. A general graph structure is possible, i.e. the graph need not be a tree structure. 7 S YNTAX FOR S EMISTRUCTURED D ATA Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} } } Observe: Nested tuples, set-values, oids! 8 S YNTAX FOR S EMISTRUCTURED D ATA May omit oids: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } } } 9 9 9 9
V S . R ELATIONAL M ODEL Missing attributes Additional attributes Multiple attribute values (set-valued attributes) Objects as attribute values No global schema only the first characteristics supported by relational model, all others are not 10 10 10 10 V S . R ELATIONAL M ODEL Semistructured data Self-describing, Irregular data, No a-prioristructure. Relational DB Separate schema, Regular data, A-prioristructure. 11 11 11 11 XML
I MPORTANT XML S TANDARDS XSL/XSLT: presentation and transformation standards RDF: resource description framework (meta-info such as ratings, categorizations, etc.) Xpath/Xpointer/Xlink: standard for linking to documents and elementswithin Namespaces: for resolving name clashes DOM: Document Object Model for manipulating XML documents SAX: Simple API for XML parsing XQuery: query language 13 13 13 13 XML A W3C standard to complement HTML Origins: Structured text SGML Large-scale electronic publishing Data exchange on the web Motivation: HTML describes presentation XML describes content http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000) HTML4.0 XML SGML 14 14 14 14 F ROM HTML TO XML HTML describes the presentation 15 15 15 15
HTML <h1> Bibliography </h1> <p> <i> Foundationsof Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995 <p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999 HTML describes the presentation 16 16 16 16 XML <bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> … </bibliography> XML describes the content 17 17 17 17 W HY ARE WE DB’ ERS INTERESTED ? It’s data. That’s us. Database issues: How are we going to model XML? (graphs). How are we going to query XML? (XQuery) How are we going to store XML (in a relational database? object-oriented? native?) How are we going to process XML efficiently? (many interesting research questions!) 18 18 18 18
E LEMENTS Tags book, title, author, … start tag: <book>, end tag: </book> defined by user / programmer (different from HTML!) Elements <book>…<book>,<author>…</author> An element consists of a matching start and end tag and the enclosed content . Elements can be nested , i.e. content of one element can consist of sequence of other elements. 19 19 19 19 A TTRIBUTES Attributes can be associated with any element. Provide additional information about elements. Attributes can have only one value. Example <book price = “55” currency = “USD ”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year> </book> Attributes can also be used to connect elements. 20 20 20 20 N ON - TREE - LIKE XML So far: only tree-like XML documents, i.e. each element is nested within at most one other element. Attributes can also be used to create non-tree XML documents. Attributes with a domain of ID serve as primary keys of elements. Attributes with a domain of IDREF serve as foreign keys referencing the ID of another element. 21 21 21 21
N ON - TREE - LIKE XML Example of a non-tree structure <persons> <person personid=“o555”> <name> Jane </name> </person> <person personid=“o456”> <name> Mary </name> <children refs=“o123 o555”</children > </person> <person personid=“o123”mother=“o456”> <name>John</name> </person> </persons> 22 22 22 22 N AMESPACES An XML document can involve tags that come for multiple sources. One and the same tag can appear in more than one source. <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table> 23 23 23 23 N AMESPACES Name conflicts can be resolved by prefixing tag names according to their source. <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td></h:tr> </h:table> <f:table> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> When using prefixes in XML, a namespace for the prefix must be defined. The namespace must be referenced (via an URI) in the start tag of an enclosing element . 24 24 24 24
W ELL -F ORMED XML A well-formed XML document satisfies the following conditions: Begins with a declaration that it is XML. Has a single root element that encloses the whole document. Consists of properly nested elements, i.e. start and end tag of an element are within the same enclosing element. standalone =“yes” states that document has no DTD. In this mode, you can invent your own tags, like in semistructured data model. 25 25 25 25 W ELL -F ORMED XML <?XML version=“1.0” standalone =“yes” ?> <bibliography> <book> <title> Foundations… </title> <author>Abiteboul</author> <author>Hull </author> <author>Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> <book> <title> … </title> . . . </book> … </bibliography> 26 26 26 26 W ELL -F ORMED XML HTML browsers will display documents with errors (like missing end tags). The W3C XML specification states that a program should stop processing an XML document if it finds an error. The main reason is that XML is being consumed by programs rather than by humans (as HTML). W3C providesa validator that checks whether an XML document is well-formed. 27 27 27 27
Recommend
More recommend