1
Up to this point in the course, we’ve looked at highly structured data. Every tuple has a known schema, which allows the database engine to store data very efficiently. Unfortunately, all information about the database schema (relations, attributes, types, etc.) is stored separately from the data, and the data are virtually impossible to interpret in isolation. Try looking at a sqlite3 database in a text editor some time, for example. At the beginning of the course, we briefly touched on a different type of data: text- based and fairly self-describing. A person could load the dinesafe dataset in a simple text editor and quickly see what kind of data it contains. 2
To highlight just how opaque structured data can be, consider the C-style struct on the left. The schema, or metadata, is given as code, and the compiler applies that information anywhere a shape object is used. In other parts of the code, the compiler can infer from context that the object in curly braces on the right denotes a shape struct, but there is little information available at the use site. Looking at data stored in memory is even worse (see sample data dump on the right). There is no way to tell what those bytes represent. Even looking at the generated machine code, there is no easy way to reconstruct the original type definition, because the compiler has no further use for that metadata once it finishes generating code. The practice of embedding part of all of the schema in the application logic is a serious threat to data independence, and makes data interchange much more difficult. 3
We just saw that a stream of highly structured bytes is essentially impossible to interpret unless that structure is known or can be inferred; at the other extreme, plain text is very difficult to interpret because it has no official structure that could be known or inferred. Fortunately, even plain text has some structure, and most forms of written information have a surprising amount of structure embedded in them. Writing consists of words, sentences and paragraphs, for example; books contain chapters; web pages are written in HTML, which can expose a surprising amount of structure with constructs such as headings, tables, etc. In spite of this structure, however, there is no real “schema” because every piece of data could be structured and formatted differently. This leads to a notion of “ semistructured ” data: data without a well -known schema, but which has structure that can be extracted by examining it. Where a database enforces “well - structured” data (all tuples conform to the schema), semistructured data requires only “well - formatted” data that can at least be parsed mechanically and its structure extracted. Such data is also called “self - describing” because each datum embeds information about its structure: field names, types, etc. Self- describing data are highly portable, but also tend to be verbose and redundant. 4
In this chapter we will introduce XML (eXtensible Markup Language), one of the most common self-describing data formats. Before moving to XML, however, we will briefly look at two other types of data that are self-describing to some degree. HTML, as noted previously, is highly structured and can often be used to infer a surprising amount of metadata about the data it presents. HTML is also extremely widespread and well-known. Unfortunately, HTML suffers two major deficiencies as a data format. First, it was originally designed for data presentation, and most of its structure deals with showing data in a particular way. Any description of the data itself is accidental and incomplete at best. Second, multiple constructs can be used to represent the same formatting (e.g. div vs. span vs. table), and constructs are often misused, meaning that even the structure we do have is inconsistent, buggy, and difficult to parse correctly (a major source of rendering bugs in various web browsers, especially in past years). 5
One major competitor to XML is JavaScript Object Notation (JSON), a data exchange format that emerged organically as interactive web pages, driven by JavaScript, gained popularity. Practitioners found it simple and intuitive to send snippets of .js code that could initialize objects, relying on the language’s built -in parsing to convert that code into objects. JSON syntax is extremely clear and lightweight, in contrast to the complexity and ambiguity of HTML, but suffers two main drawbacks: First, it has no support for any kind of metadata, other than field names, and is underspecified in the sense that it cannot constrain types or the ranges of values fields can take. This means that a large fraction of metadata will always be embedded in application logic when using JSON, making it more suitable for well-defined exchanges between cooperating parties than for storing data that might be seen or used by outsiders. The second drawback is that JSON lacks significant, mature, data processing tools. Applications using JSON must load the data directly, then process it in whatever way they can; query functionality in JSON- based tools such as the various javascript frameworks MongoDB is cursory at best. 6
Here you can see a small sample of XML data. It (intentionally) takes a similar form to HTML, as we will see on the next page. 7
XML is a hierarchical data format, meaning that everything is arranged as a tree. Pieces of data are surrounded by “tags”, which are themselves names surrounded by angle brackets. The tags serve as metadata, identifying the name and often the type of the data they surround. Within an opening tag, “attributes” can provide even more metadata information (types, flags, etc.). There are several key differences between HTML and XML. First, XML is always strictly parsed: shortcuts and other forms of laziness are not allowed in a properly formatted XML document, and XML parsers are encouraged to reject invalid XML rather than attempting to work around syntax errors the way HTML parsers usually do. Second, tags have no fixed meaning in plain XML, and certainly no implied presentation format:<ul > does not necessarily mean “unordered list” and probably not even “underline.” In any case, an XML parser is not required to attach any meaning to tags; it only needs to parse them and present them to the application. This simplifies XML parsing drastically. 8
This diagram highlights the basic characteristics of XML. Pieces of data are called “elements” and can contain text, other elements, or both, or may be empty. Each element is surrounded by tags that give it a name, and zero or more attributes supply additional information. Each XML document consists of a single element (the “root”) that contains all other elements. 9
The hierarchy in XML is “well - nested” in the sense that every opening tag has a corresponding closing tag, and closing tags arrive in the opposite order of opening tags. In other words, <a><b></a></b> is invalid XML because the element <a> closes before all of its children have closed (<b> in this case). In keeping with its hierarchical nature, we refer to parents, children, and siblings of elements; we will cover element relationships in more detail later. 10
Here’s another example XML document. Note the XML comment syntax. 11
A graphical representation of an XML document (a subset of which was shown on the previous page) 12
This pretty much speaks for itself. XML has basic rules for correctness, but can also include further specifications of what content is allowed in a given document (element names, what goes inside what, etc.) 13
14
15
We’ll have to pay attention to whitespace later; xquery and xslt can both leak whitespace in user- visible ways if you’re not careful. 16
Another example... 17
xmllint is part of the libxml2 package, available on cygwin and most other unix-like platforms. 18
19
20
DTD exists to improve inter-op between entities sharing data, by making explicit some of the schema information embedded in application logic. 21
A DTD specifies the form each of these building blocks can take and how they can be combined. 22
#CDATA is character data that the parser treats as opaque. All text inside it is ignored, other than to check for the character sequence that marks its end. The parser looks inside #PCDATA, on the other hand, and rejects text that looks like XML. 23
DTD is fairly flexible or specifying what children an element can have, but it’s not perfect. There’s no easy way to specify unordered sequences, for example. The closest you can come is something like <!ELEMENT foo (a*, b*, c*, a*, b*, c*)>, which would allow some reordering of a/b/c. However, that trick doesn’t scale (it wouldn’t allow c b a, for example) and doesn’t work with + and ?. 24
This should be pretty readable. The last element shows another way to allow unordered elements. It scales (award, award, honor, award is allowed), but still doesn’t work with + or ?. 25
Here is a full DTD specification for a hypothetical gamer database. 26
Moving on to attributes, we see quite a few options for constraining the values attributes can take (far more than we have for the text inside elements). 27
Recommend
More recommend