The XML Typechecking Problem Dan Suciu, University of Washington Presented by T.J. Green ∗ University of Pennsylvania February 19, 2004 ∗ with L A T EX slides!
XML Data Model Subset of XQuery data model: XML documents are ordered trees with labels at nodes. More precisely, fix an alphabet Σ of tag names, attribute names, and atomic type names. Denote T Σ the set of ordered trees where each node is labeled with an element from Σ. 1
XML Types A type is a subset of T Σ that is a regular tree language. Formally, a type is defined by a set of type identifiers T and associates to each identifier a regular expression over Σ × T . 2
XML Types - an example Here is an example, using XQuery’s syntax. TYPE Catalog = ELEMENT catalog(Products) TYPE Products = (ELEMENT product(Product))* TYPE Product = (ATTRIBUTE name(STRING)?, (ELEMENT mfr-price(INTEGER) | ELEMENT sale-price(INTEGER))*, (ELEMENT color(STRING))*) 3
Expressiveness of type formalism Of course, this formalism does not capture all the details of a real XML type system. But it is actually more powerful than XML Schema or DTD’s in one respect. 4
Expressiveness of type formalism Consider the set of pairs ( σ, t ) ∈ Σ × T that occur in the regular expression for some type identifier. XML Schema requires that σ be a key in this collection. DTD requires that σ be a key in the entire collection of pairs in all regular expressions. 5
Regular tree languages Regular tree languages extensively studied for ranked trees (i.e., where the number of children of a node is fixed). But XML data model is unranked . 6
Modified regular tree languages Various equivalent modifications can handle this (extending tree automata to unranked trees; using specialized DTD’s ; mapping unranked trees into ranked binary trees; defining types as in XDuce or XQuery). Here we use the XQuery style regular types. 7
Containment of regular tree languages Key property of regular tree languages: given two types τ 1 , τ 2 , can check whether τ 1 ⊆ τ 2 . High complexity in general (EXPTIME-complete). But in PTIME if τ 2 corresponds to a deterministic tree automaton. 8
The Validation Problem Given a tree t ∈ T Σ and a type τ , decide whether t ∈ τ . But what if instead of a document , we are given a program whose output is an XML document? 9
The Typechecking Problem Given a program P defining a function P : D → T Σ , where D is the program’s input domain, and a type τ ⊆ T Σ . Decide whether ∀ x ∈ D , P ( x ) ∈ τ . 10
The Typechecking Problem So the typechecker analyzes the program and decides whether all documents produced by the program are valid, and returns yes or no . If no , we would also like to know where in the program type- checking failed. (May be hard though.) 11
The Typechecking Problem Typechecking may not even be possible, in which case we may need to settle for an incomplete typechecker , which may reject some programs that in fact do typecheck. 12
The Type Inference Problem A kind of dual of the typechecking problem. Given a program P , compute the type P ( D ) = { P ( x ) | x ∈ D} . Again, perfect type inference may not be possible, and we may need to settle for incomplete type inference . 13
What kind of programs? We consider two different kinds of programs, depending on the application . 14
Application 1 - XML Publishing Here the XML document is a view over a relational database. The program’s domain is D = Inst ( S ), the set of all database instances of some schema, S . S may contain key and foreign key constraints. P may perform only simple select-project-join queries on the database, nest the results, and add appropriate XML tags. 15
Application 1 - XML Publishing Consider some database whose schema S is defined as follows. product(pid:STRING, name:STRING, mfrprice:INTEGER), colors(cid:STRING, pid:STRING, color:STRING), sale(sid:STRING, pid:STRING, price:INTEGER) First attribute of each relation is a key. Foreign key constraints suggested by attribute names. 16
Application 1 - XML Publishing Now, here is an example of an XQuery program that produce an XML view of this database. <catalog> { FOR $p in $db/product/tuple RETURN <product name = { data($p/name) }> <mfr-price> { data($p/price)} </mfr-price> { FOR $s in $db/sale/tuple WHERE $p/pid = $s/pid RETURN <sale-price> { data($s/sprice) } </sale-price> } { FOR $c in $db/color/tuple WHERE $p/pid = $c/pid RETURN <color> { data($c/color) } </color> </product> } </catalog> 17
Application 2 - XML Transformations The other class of applications we consider is those which require XML Transformations . Here, the program’s input is an XML document, that is, the program’s domain D is either T Σ or some XML type τ . The output is another XML document. 18
Application 2 - XML Transformations We take as our programming language a restricted fragment of XSLT that includes: • recursive templates • modes • apply-template can be called along any XPath axis • variables can be bound to nodes in the input atree, then passed as pa- rameters • an equality test can performed between node ID’s, but not between node values 19
Application 2 - XML Transformations We can formalize this language in terms of k -pebble tree trans- ducers. That formalism is beyond the scope of this talk. 20
Type Checking or Type Inference? One way to perform typechecking is by using type inference: infer the output type τ 1 of the program, and check for contain- ment within the desired output type τ 1 ⊆ τ 2 . We’ll first consider type inference. 21
Type Inference Consider the XQuery program shown a few slides back. We humans can infer its output type as TYPE T1 = ELEMENT catalog(T2) TYPE T2 = (ELEMENT product(T3))* TYPE T3 = ATTRIBUTE name(STRING), ELEMENT mfr-price(INTEGER), (ELEMENT sale-price(INTEGER))*, (ELEMENT color(STRING))* How? catalog tag at root is obvious ( T1 ). Several product chil- dren ( T2 ). Analyze RETURN clause: product has exactly one name attribute, one mfr-price child, and several sale-price and color children. 22
Type Inference More programmatically, the general idea is that one infers the type of a RETURN expression from the types of its components. The XQuery formal semantics applies this to the entire language by providing type inference rules for each language construct. Type inference is used to perform typechecking in XQuery. 23
Type Inference For the XML publishing application, we actually need an en- hancement to make use of key and foreign key constraints in order to infer the correct output type. For example, knowing that pid is also a key for sale (each prod- uct has at most one sale price) narrows T3 by replacing (ELEMENT sale-price(INTEGER))* with (ELEMENT sale-price(INTEGER))? . 24
Limtations of Type Inference Suppose the the relational schema has a single table, R(x,y) , and the XQuery program is: <result> { FOR $x in $db/R/tuple RETURN <a/>, FOR $x in $db/R/tuple RETURN <b/> } </result> 25
Limitations of Type Inference XQuery infers its output type as TYPE T = ELEMENT result((ELEMENT a)*, (ELEMENT b)*) but the real output type is: P ( D ) = { ELEMENT result (( ELEMENT a ) n , ( ELEMENT b ) n ) | n ≥ 0 } since we have the same number of a ’s and b ’s. But obviously this is not a regular tree language, so we cannot hope to infer it, and must settle for T instead. 26
Limitations of Type Inference But T is an ad-hoc choice, and now we incorrectly fail to type- check with respect to the output type T1 = ELEMENT result() | ELEMENT result(ELEMENT a, (ELEMENT a)*, ELEMENT b, (ELEMENT b)* The program in reality typechecks to this type, because T1 just rules out the cases of (0 a ’s, 1+ b ’s) or (1+ a ’s, 0 b ’s). Yet the type-checker rejects it, because T 1 �⊆ T . 27
Typechecking Given these limitations, maybe we can do better trying to do typechecking without type inference? Indeed, given certain restrictions on the programming language and output type, it is possible. 28
Typechecking for XML Publishing Here is an algorithm for typechecking P against τ : enumerate all “small” input databases (up to a size which depends only on P and τ ); run P on each; check that the output conforms to τ . Not the most efficient algorithm, but it works*! 29
Typechecking for XML Publishing *Actually, two restrictions on the output type τ are required: • τ must be a DTD type • all regular expressions in τ must be “star-free” 30
Aside: star-free regular expressions Star-free means no Kleene closure, but can use the comple- ment, compl , and the empty set, ∅ . This gives something Kleene closure-like, which in fact can express all examples given so far in this talk. For example, if Σ = { a, b, c } , then compl ( ∅ ) denotes Σ ∗ , and compl (Σ ∗ .b. Σ ∗ | Σ ∗ .c. Σ ∗ ) denotes a ∗ . But, not all Kleene closure expressions can be expressed this way. An example that cannot: ( a.a ) ∗ . 31
Limitations of Typechecking Unfortunately, the restrictions we have given are critical. Allowing output types that are not DTD’s or increasing the ex- pressive power of the language leads to undecidability. 32
Recommend
More recommend