XML and Web Data
Data in HTML • HyperText Markup Language – Different data elements are set out using tags • No schema? – Based on the data itself, we can make a reasonable guess about the structure – “Self-describing” CMPT 354: Database I -- XML 2
3 Object and Schema CMPT 354: Database I -- XML
Semi-structured Data • Object-like: it can be represented as a collection of objects • Schemaless: it is not guaranteed to conform to any type structure • Self-describing – Often carries only the names of the attributes and has a lower degree of organization than the data in the database • Semi-structured data: data with the above characteristics CMPT 354: Database I -- XML 4
Schemaless But Self-Describing (#12345, [ListName:“Students”, Contents:{ [Name:“John Doe”, ID:“111111111”, Address:[Number:123, Street:“Main St”] ], [Name:“Joe Public”, Id:“666666666”, Address:[Number:666, Street:“Hollow Rd”] ]} ] ) CMPT 354: Database I -- XML 5
XML • Extensible Markup Language – A standard adopted in 1998 by the W3C (World Wide Web Consortium) • Optional mechanisms for specifying document structure – DTD: the Document Type Definition Language, part of the XML standard – XML Schema: a more recent specification built on top of XML • Query languages for XML – XPath: lightweight – XSLIT: document transformation language – XQuery: a full-blown language CMPT 354: Database I -- XML 6
7 From HTML to XML CMPT 354: Database I -- XML
HTML and XML • HTML – A fixed number of tags – Each tag has its own well-defined meaning • E.g., <table> … </table> • XML: HTML-like language – An arbitrary number of user-defined tags – No a priori semantics – Mainly for data exchange – Display using stylesheet CMPT 354: Database I -- XML 8
Important Differences • XML contains a large assortment of tags chosen by the document author – The only valid tags in HTML are those sanctioned by the official specification of the language; other tags are ignored • Every opening tag must have a matching closing tag, and the tags must be properly nested – E.g., <a><b></a></b> is not allowed – Some HTML tags are not required to be closed, e.g., <p> • The document has a root element – the element that contains all other elements CMPT 354: Database I -- XML 9
Example Mandatory statement Root element XML elements Element names Element contents CMPT 354: Database I -- XML 10
Hierarchical Structure PersonList Student Title Contents Person Person Name: John Doe Name: Joe Public Id: 111111111 Id: 666666666 Address Address Number: 123 Number: 666 Street: Main St Street: Hollow Rd CMPT 354: Database I -- XML 11
Attributes • <PersonList Type=“Student”> – Type is the name of an attribute that belongs to the element PersonList – Student is the attribute value – All attribute values must be quoted – Text strings between tags do not need to be quoted • Empty element – <Title Value=“Student List”/> – The element has one attribute and no content – A shorthand for <Title Value=“Student List”></Title> CMPT 354: Database I -- XML 12
Processing Instructions & Comments • Processing instructions – <?xml version=“1.0” ?> – Contain anything the author might want to communicate to the XML processor, e.g., <?my-command go bring coffee?> – Rarely used • Comment – <!-- A comment --> – Can occur everywhere except inside the markups, i.e., between symbols < and > – An integral part of the document – May be used by a receiver (e.g., a browser) CMPT 354: Database I -- XML 13
CDATA Construct • Include strings of characters which contain markup elements that might make the document ill formed • <![CDATA[ This is an example of markup in HTML: <b><i> Example <\b><\i>]]> CMPT 354: Database I -- XML 14
XML Elements and Data Objects • XML allows mixed data/text structure • XML elements are ordered • XML has only one primitive type, string, and very weak facilities for specifying constraints <Address> A legal XML document <Number> 123 </Number> <Address> <Street> Main St </Street> Sally lives on </Address> <Street> Main St </Street> is different from house number <Address> <Number> 123 </Number> <Street> Main St </Street> in the beautiful Anytown, Canada. <Number> 123 </Number> </Address> </Address> CMPT 354: Database I -- XML 15
Use of Attributes • An element can have any number of user-defined attributes • What attributes can do can also be achieved with elements – An attribute may occur only once within a tag, while subelements with the same tag may be repeated • Attributes introduce ambiguity as to whether to represent information as attributes or elements – Sometimes convenient for representing data, can also be done with elements – The use of attributes is expected to decline <Address> <Number> 123 </Number> <Address Number=“123” Street=“Main St/> <Street> Main St </Street> </Address> CMPT 354: Database I -- XML 16
Attributes in Markup <Act Number=“5”> <Scene Number=“1” Place=“Mantua. A street”> … <Apothecary Voice=“scared”> Such mortal drugs I have; but Mantua’s law Is death to any he that utters them. </Apothecary> <Romeo Voice=“persistent”> Art thou so bare and full of wretchedness, And fear’st to die? … </Romeo> … </Scene> </Act> CMPT 354: Database I -- XML 17
Advantages of Attributes • Attributes in an element are not ordered – <Address Number=“123” Street=“Main St”/> – <Address Street=“Main St” Number=“123”/> • Attributes are more succinct • Attributes can be declared to have unique value and can be used to enforce limited kind of referential integrity <Address> <Number> 123 </Number> <Street> Main St </Street> </Address> CMPT 354: Database I -- XML 18
ID and IDREF – Cross-References CMPT 354: Database I -- XML 19
Well Formed XML Document • It has a root element • Every opening tag is followed by a matching closing tag, and the elements are properly nested inside each other • Any attribute can occur at most once in a given opening tag, its value must be provided, and this value must be quoted CMPT 354: Database I -- XML 20
Namespaces • A term (tag) might have different meanings in different contexts – <name><First>John</First> <Last>Doe</Last></Name> – <Name>Simon Fraser University</Name> • Every XML tag must have two parts: namespace and local name – General structure: namespace:local-name – Namespace represented by URI (uniform resource identifier) • An abstract identifier (a general unique string) • URL (uniform resource locator) CMPT 354: Database I -- XML 21
Example – Namespace • Namespaces are defined using the attribute xmlns – All names xml* should be considered reserved • Default namespace xmlns=“…” – Only one default namespace • Other namespace xmlns:toy=“…” – Prefixes (e.g., toy) must be distinct <item xmlns=“http://www.acmeinc.com/jp#supplies” xmlns:toy=“http://www.acmeinc.com/jp#toys”> <name>backpack</name> <feature> <toy:item> <toy:name>cyberpet</toy:name> </toy:item> </feature> </item> CMPT 354: Database I -- XML 22
Namespace Declarations • Namespace as prefix – E.g., toy:item, toy:name – Tags without prefix belong to the default namespace • Namespace declarations have scope – Can be nested like a program block CMPT 354: Database I -- XML 23
Example – Scopes of Namespaces <item xmlns=“http://www.acmeinc.com/jp#supplies” xmlns:toy=“http://www.acmeinc.com/jp#toys”> <name>backpack</name> <feature> <toy:item> <toy:name>cyberpet</toy:name> </toy:item> </feature> <item xmlns=“http://www.acmeinc.com/jp#supplies2” xmlns:toy=“http://www.acmeinc.com/jp#toys2”> <name>notebook</name> <toy:name>sticker</toy:name> </item> </item> CMPT 354: Database I -- XML 24
More About Namespace • The name of a namespace is just a string that happens to be a URL • Not necessarily it is a real address that contains some kind of schema describing the corresponding set of names • Don’t be misled by the URL! CMPT 354: Database I -- XML 25
Summary • HTML and XML: differences and applications • Structure of XML – Elements – Attributes – Well formed XML documents • Namespace CMPT 354: Database I -- XML 26
To-Do-List • Can every relational table be represented in XML? Can every XML document be represented in a relational table? • RSS is an application of XML. Try to understand the two RSS segments at http://www.xml.com/pub/a/2002/12/18/dive- into-xml.html CMPT 354: Database I -- XML 27
Recommend
More recommend