CS490W Semi-Structured Data Structure of XML � XML data is organized by documents like unstructured data XML data and Retrieval � There are structures (nodes/tags) within the documents � Each XML document is an ordered, labeled tree � Element Nodes are labeled with Luo Si � Node name (e.g., chapter) Department of Computer Science � Node attributes and the values (e.g., size=1000; Purdue University time=01/01/2007) � May have child nodes or data � Data exist (e.g., text strings) within leaf nodes XML and Retrieval: Outline XML Example Outline: <book id=“ML_Tom”> <title> Machine Learning </title> � Semi-Structure Data <author> <firstname> Tom </firstname> � XML, Examples, Application <surname> Mitchell< /surname> </author> � XML Search ... � XQuery <p> Machine Learning Applications. ..</p> ... � XIRQL </book> � Text-Based XML Retrieval Elements, Attributes/Values, Data(Text String) � Vector-space model � INEX Semi-Structured Data XML Example XML has been used as the standard representation of Semi- <book id=“ML_Tom”> <title> Machine Learning </title> Structured Data <author> <firstname> Tom </firstname> � e X tensible M arkup L anguage <surname> Michael </surname> book </author> is a W3C-recommended general-purpose markup language that supports a wide ... title <p> Machine Learning Applications. ..</p> variety of applications. ... � A framework for defining markup languages author </book> firstname surname � Open vocabulary for tags Elements, Attributes/Values, Data(Text String) chapter chapter � Each set of XML corresponds to different applications title para para para � facilitate the sharing of data across different information systems, particularly systems connected via the Internet � Examples: RSS, XHTML, MathML
Elements Why XML? � Elements are defined by markup tags � Unlike relational database, XML data does not require relational schemata, etc., because the data itself contains � Elements: <TagName attr_a=“value”…>text</TagName> this information. � ID of the element is TagName � Unlike widely used Web format, HTML, which only ensures � Attribute: attr_a; Values=“value” the correct presentation of the formatted data, XML also � Data/text: “text” guarantees total usability of data. � End tag </TagName> XML, HTML, SGML XML Applications 1986: SGML ISO 8879-1986 � CML – chemical markup language: Nov 1995: HTML 2.0 � WML – wireless markup language Nov 1996: Simplified and stripped down SGML draft � ThML – theological markup language (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML XML Applications � Both of them are derivations of SGML � CML – chemical markup language: � HTML is a markup language mainly for display in browsers CML ( C hemical M arkup L anguage) is a new approach to managing � XML is a framework for markup languages molecular information using tools such as XML and Java. It was the first � HTML defines display domain specific implementation based strictly on XML, � XML defines the data structure, the display factor is separated from the content <molecule convention="MDLMol" id="baclofen" title="BACLOFEN"> � HTML can be formalized as XML (XHTML)
XML Applications XML Files � <?xml version="1.0"?> � WML – wireless markup language <!DOCTYPE note [ Wireless Markup Language , is a content format for devices that implement the <!ELEMENT note (to,from,heading,body)> Wireless Application Protocol (WAP) specification, such as mobile phones. <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <?xml version="1.0"?> <!ELEMENT heading (#PCDATA)> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" "http://www.phone.com/dtd/wml11.dtd" > <!ELEMENT body (#PCDATA)> <wml> ]> <card id="main" title="First Card"> <note> <p mode="wrap">This is a sample WML page.</p> <to>Tove</to> XML Document </card> <from>Jani</from> </wml> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> XML Applications XML Files � XML Schema: � ThML – theological markup language Recommended by the W3C as the successor of DTDs, more informally <ThML> referred to by the initialism for XML Schema instances, XSD (XML <ThML.body> Schema Definition). XSDs are far more powerful than DTDs in describing – <div1> XML languages . � <div2 title="Genesis" id="Gen"> <xs:schema – <div3 title="Chapter 1"> • <p> xmlns:xs="http://www.w3.org/2001/XMLSchema"> • <scripture/> <xs:element name="country" type="Country"/> • In the beginning God created the heaven and the earth. • <scripture/> <xs:complexType name="Country"> • And the earth was without form, and void; and darkness was upon the face of the deep. <xs:sequence> And the Spirit of God moved upon the face of the waters. • </p> <xs:element name="name" type="xs:string"/> – </div3> � </div2> <xs:element name="population" type="xs:decimal"/> – </div1> </xs:sequence> </ThML.body> </xs:complexType> </ThML> </xs:schema> XML Files XML Search � Schema/DTD: syntax definition of XML Language; � Most XML Search protocols use a database-based approach Document Type Definition (DTD file) � Non-text data match XML provides an application independent way of sharing data. With a DTD, � Exact keyword (text) match independent groups of people can agree to use a common DTD for interchanging data. However, this is often NOT the case � Evaluate XML path expression <?xml version="1.0"?> � No concept of relevant <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]>
XML Search Principal Forms � Traditional XML Search from Database-based approach � Path Query � XQuery /book//title contains “Information Retrieval” � Search multiple types of data: value-based (e.g., price of title of the book contains keywords “Information Retrieval” a book); ids (ISBN of book); keyword match (text) � Conditional expressions � XML text search from information retrieval approach $h/title, � XIRQL IF $h/@type = "Journal" THEN …. � Vector-space based if the type of an article is journal � Search text data: estimate relevance of xml elements with respect of query � Query may contain path expressions XML Search Flowers (FLWR) � XQuery � Programming Language: Flowers (FLWR) expression � SQL for XML The programming language XQuery defines FLWOR or FLWR (often pronounced as 'flower') as expression that supports � Used for text-rich documents; data-oriented documents iteration and binding of variables to intermediate results. (non-text); mixed documents � For and let create a sequence of tuples � Consider: path expression (XPath); XML Schema � where filters the tuples on a boolean expression datatypes � order by sorts the tuples, using any comparable data � It is still a working draft; details are being improved � return gets evaluated once for every tuple XML Search Flowers (FLWR) for $d in document("depts.xml")//deptno � XQuery considers some principal forms let $e := document("emps.xml")//employee[deptno = $d] � Path expression where count($e) >= 10 � Conditional expressions order by avg($e/salary) descending � Datatype expressions return <big-dept> { $d, <headcount>{count($e)}</headcount>, � List expression <avgsal>{avg($e/salary)}</avgsal> } � etc </big-dept> � Programming Language: Flowers (FLWOR) expression � Principle forms can be evaluated with respect to context
Recommend
More recommend