COMP6037 Semi-structured Data and the Web Tree Grammars and Relax - PowerPoint PPT Presentation

COMP6037 Semi-structured Data and the Web Tree Grammars and Relax NG, week 3 Uli Sattler University of Manchester 1

General Stuff • Read Blackboard’s Announcements • Read Blackboard’s Discussions • Forward your Blackboard email to your email account • Start early with your coursework...trying to figure out what to do on Sunday night might be difficult! 2

Schema languages...they are sooo different! • I hope you understand that – there are various issues involved in the validation of a document – validation and data quality are closely related – structure and content are different aspects of a document – people have built numerous formalisms & tools to help you validate your documents • with various applications in mind • with various goals re. generality/complexity/simplicity • with various expressive means/restrictions to describe – datatypes – structure • with various object-oriented mechanisms such as inheritance for ease of authoring & maintaining – which of them to choose (first/how to mix) depends on your application • I invite you to compare this with other areas, e.g., parser generators 3

Schema languages...they are sooo different! • So far, you know 2 XML schema languages: – DTDs – WXS • There are many more for XML – SOX, SGML, RelaxNG, Schematron, … • They differ in 1. their style 2. usability 3. computability • how complex is it to validate documents? 4. expressive power (which is closely related to (3)) • data type support • structural expressiveness – what kind of trees can I describe? • uniqueness constraints – ... 4

A comparison: who can describe structure better? • Obvious question: ➡ which of DTD, WXS, and RelaxNG has more powerful means to describe structure? – how to answer this? – perhaps they are orthogonal? – how to compare DTD (no types) with WXS, and RelaxNG (later)? – how do we know how costly validation is? • e.g., is validating a document D against a DTD more expensive than validating a document agains an XML Schema? • “Ah, let’s build DTD-validator and WXS-validator and compare run-time and memory requirement for similar (?) inputs on the same document!” • ...how can we make sure that we have been equally clever for both validators? • ...what are “similar inputs”? • ...how many do we need to test for? 5

Interesting example where DTDs and WXS differ • assume we want to define <xs:complexType name="NameType"> <xs:all> an element NameType in <xs:element name="firstname" type="xs:string"/> a DTD…. <xs:element name="secondname" type="xs:string"/> <xs:element name="lastname" type="xs:string"/> </xs:all> </xs:complexType> • how? • ….it’s unmanageable and long (how?), but possible • are there other examples where something is not possible in DTD? Or in WXS? 6

A comparison: who can describe structure better? • Use formal methods: 1. view XML document D as a simplified DOM tree T D • capture a schema S in a grammar/automaton G S • such that T D ∈ L(G S ) if and only if D validates against S • i.e., validation corresponds to acceptance by automaton 2. schema languages of type X ⇋ grammar/ automata of type C X 3. check: for each G S1 of type C X capturing a schema S1 of type X, can we build G S2 of type C Y capturing a schema S2 of type Y, such that L(G S1 ) = L(G S2 )? If yes, then Y is as expressive as X. • ...so, let’s do this: we will see definitions and lemmas/theorems ๏ definitions fix the meaning of terms in an unambiguous way (and you need to understand them to follow the rest) ★ lemmas and theorems state properties of the concepts introduced • we will see and work on examples 7

Basics - regular expressions ๏ Given a set of symbols N, the set of regular expressions regexp(N) over N is the smallest set containing – the empty string ε and all symbols in N and – if e 1 and e 2 ∈ regexp(N), then so are • e 1 ,e 2 (concatenation) • e 1 |e 2 (choice) • e 1 * (repetition) ๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w 1 w 2 and e = (e 1 , e 2 ) and w 1 matches e 1 and w 2 matches e 2 , or – if e = (e 1 | e 2 ) and w matches e 1 or w matches e 2 – if w = ε and e = e 1 * – if w = w 1 w 2 ... w n and e = e 1 * and each w i matches e 1 8

Basics - regular expressions ๏ Given a regular expression e, a string w matches e, – if w = ε = e or w = n = e for some n in N, or – if w = w 1 w 2 and e = (e 1 , e 2 ) and w 1 matches e 1 and w 2 matches e 2 , or – if e = (e 1 | e 2 ) and w matches e 1 or w matches e 2 – if w = ε and e = e 1 * – if w = w 1 w 2 ... w n and e = e 1 * and each w i matches e 1 • Hence we can use – e + as abbreviation for (e,e*) – e? as abbreviation for (e| ε ) • Test: – does ababa match (a, b)* – does ababa match (a | b)* – does abababa match (a,b)* , a?, b, a, b* 9

Towards trees... • A regular expression e describes a set of strings L(e): – L(e) := { w | w matches e} – L(e) can be finite, infinite,...can it be empty? • A schema S describes a set of trees L(S): – L(S) := { t | t validates against S} – L(S) can be finite, infinite, empty,….can it be empty? • As a simplification, we will look into Tree Grammars: • A tree grammar G describes a set of trees L(S): – L(G) := { t | G accepts t } 10

Trees: nodes as strings! A tree A tree A tree over {A,B,C} with nodes as strings B ε ε A A 0 1 0 1 B 1,0 1,0 B A B 0,0 0,1 0,2 0,0 0,1 0,2 11

Trees: nodes as strings! B ε ๏ We use ℕ for the non-negative integers (including 0) A A ๏ we use ℕ * for the set of all (finite) strings over ℕ 0 1 B • ε is used for the empty string 1,0 B A B • 0,1,0 is a string of length 3 0,0 0,1 0,2 • each string stands for a node ๏ An alphabet is a finite set of symbols ๏ A tree T over an alphabet Σ is a mapping T: ℕ * → Σ with a domain that is ๏ finite (i.e., T(n) is defined for only finitely many strings over ℕ ) ๏ contains ε (i.e., T( ε ) is defined) ๏ is prefixed-closed (i.e., if T(w,n) is defined, then T(w) is as well) • Explanation: • the strings in the domain of T represent T’s nodes • (w,n) is the successor of w, • T(w) is the label of w (as shown in picture) • we use nodes(T) for the (finite) domain of/nodes in T 12

Tree Grammars: definition ๏ A tree grammar is a structure G = (N, Σ , S, P) where – N is a finite set of nonterminal symbols , – Σ is an alphabet , – S ⊆ N is a set of start symbols , and – P is a set of production rules , i.e., each R ∈ P is of the form • X → a e • where X ∈ N, a ∈ Σ , and e ∈ regexp(N) • Example: G = (N, Σ , S, P) with N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε , Affiliation → A ε } 13

Tree Grammars: what do they do? only super simple regexps N = {Book, Author, Editor, Affiliation, Paper, Country} Σ = {B, Name, A, P, C} S = {Book, Paper} P = { Book → B Editor, Paper → P Author, Editor → Name Country, Author → Name Affiliation, Country → C ε , Affiliation → A ε } • a grammar runs on a tree... • given a tree T over Σ , if G can run on T, then – G is said to accept T, – written T ∈ L(G), – i.e., T is in the language accepted by G • remember: grammar G corresponds to schema S G and want to build G such that T ∈ L(G) holds iff T validates against S G • ...let’s see a grammar run... 14

Tree Grammars: definition of runs ๏ A run of G = (N, Σ , S, P) on a tree T is a mapping r: nodes(T) → N such that: – r( ε ) ∈ S % r labels the root node of T with a start symbol – for each w ∈ nodes(T) with children w 1 w 2 ... w n , there exists a rule X → a e ∈ P such that Paper Book ε B ✘ • r(w) = X, • T(w) = a, and • r(w 1 ) r(w 2 )... r(w n ) matches e. Editor 0 Name • let’s see an example run of N = {Book, Author, Editor, Affilia, Paper, F, L} F 0,0 F L 0,1 L Σ = {B, P, Name, F, L, A} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε , L → L ε , Affilia → A ε } 15

Tree Grammars: trees accepted ๏ A tree T is accepted by a grammar G if there is a run of G on T – we write T ∈ L(G). ε P ε B ε P • which is these trees is accepted by our grammar? 0 Name 0 Name 0 Name A A L L L F 0,0 0,1 0,0 0,1 0,0 0,1 N = {Book, Author, Editor, Affilia, Paper, F, L} Σ = {B, Name, F, L, A, P} S = {Book, Paper} P = { Book → B Editor|Author, Paper → P Author, Editor → Name F,L, Author → Name L,Affilia, F → F ε , L → L ε , Affilia → A ε } 16

COMP6037 Semi-structured Data and the Web Tree Grammars and Relax - PowerPoint PPT Presentation

COMP6037 Semi-structured Data and the Web Tree Grammars and Relax NG, week 3 Uli Sattler University of Manchester 1 General Stuff Read Blackboards Announcements Read Blackboards Discussions Forward your Blackboard email

Semi-structured data Data is not just text, but is not as well- Semi-structured data

COMP6037 Semi-structured Data and the Web XPath and XQuery, week 2 Uli Sattler University of

COMP6037 We know Semi-structured Data and the Web when a grammar is local: i.e., if none

COMP6037 Read Blackboards Announcements Read Blackboards Discussions

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

COMP60411 Semi-structured Data and the Web Datatypes Relax NG, XML Schema, and Tree Grammars

COMP60411 Semi-structured Data and the Web Validating Trees against Tree Grammars The Essence of

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Grammars and Parsing Grammars and Sentence Structure What makes a good grammar A

Final Examples Announcements Trees Tree-Structured Data def tree(label, branches=[]): A tree

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

Tree grammars for induction on inductive data types modulo equational theories Gabriel Ebner,

Tree Transducers and Tree Adjoining Grammars Historical and Current Perspectives William C.

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Multi Context-Free Tree Grammars and Multi-component Tree Adjoining Grammars Joost Engelfriet 1

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

3. Defining the document structure (DTD) Declaration of application-specific names and

When are we committed to crossing critical (1.5 or 2 C) temperature thresholds? Cristian

Introduction to HTML5 Where to start learning about HTML5? HTML 5 differences from HTML 4

Window System API Agenda 1. Terminology 2. TopComponent 3. WindowManager 4. Mode

XML data exchange Amlie Gheerbrant LFCS University of Edinburgh 11/11/2010 - Dagstuhl

Identifying Query Incompatibilities with Evolving XML Schemas Pierre Genevs (with Nabil

Generating SGML specific editors from DTDs to Attribute Grammars Jos Carlos Ramalho Alda Reis

The XML Typechecking Problem Dan Suciu, University of Washington Presented by T.J. Green