XML Parsers Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th) Dept. of Computer Engineering Khon Kaen University 1
Overview What are XML Parsers? Programming Interfaces of XML Parsers DOM: Document Object Model SAX: Simple API for XML StAX: Streaming API for XML 2
What are XML Parsers? (1/2) The most common XML processing task is parsi sing ng an XML document Parsing involves reading an XML document to determine its structure and contents It is essential for the automatic processing of XML documents 3
What are XML Parsers? (2/2) Parsers also check whether documents conform to the XML standard and have a correct structure There are two types of XML parsers Validating: check documents against a DTD or an XML schema Non-validating: do not check documents against a DTD or an XML schema 4
Available Java XML Parsers APIs SUN Integrated in JDK 1.4 version and later Package javax.xml.parsers Apache Xerces: XML Parsers in Java, C++, and Perl http://xerces.apache.org/ SAX http://www.saxproject.org/ XP – an XML Parser in Java http://www.jclark.com/xml/xp/index.html 5
Programming Interfaces (1/2) PHP and Java Document Object Model (DOM) Model a document as a tree Java Simple API for XML (SAX) The user needs to create the model Streaming API for XML (StAX) Use a pull model for event processing Provide user-friendly APIs for read-in and write-out 6
Programming Interfaces (2/2) PHP SimpleXML extension Provides a very simple and easily usable toolset to convert XML to an object XMLReader extension The reader acts as a cursor going forward on the document stream and stopping at each node XMLWriter extension The writer that provides a non-cached, forward- only means of generating streams or files containing XML data 7
How to Use a Parser In general, here’s how you use a parser: Create a parser object Point the parser object at your XML document Process the results The common XML parsing tools can make the task much simpler 8
What is DOM? (1/2) DOM is an official recommendation of the W3C It defines an interface that enables programs to access and update the structure of XML documents When an XML parser claims to support the DOM, that means it implements the interfaces defined in the standard 9
What is DOM? (2/2) When you parse an XML document with a DOM parser, you get back a tree of nodes that represent the structure and contents of the XML document You can access your information by interacting with this tree of nodes 10
DOM Data Modeling Each element node contains a list of other nodes as its children These children might contain text values or other nodes DOM preserves the sequence of the elements that it reads from XML documents 11
DOM Processing Model (1/2) The DOM Processing Model consists of reading the entire XML document into memory and building a tree representation of the structured data This process can require a substantial amount of memory when the XML document is large 12
DOM Processing Model (2/2) By having the data in memory, DOM introduces the capability of manipulating the XML data by Inserting, editing, or deleting tree elements It supports random access to any node in the tree 13
What is SAX? (1/2) SAX is an alternative way of working with the information in your XML document It was designed to have a smaller memory footprint, but it puts more of the work on the grammar SAX does not crate a default object model on top of your XML document SAX was originally developed by David Megginson 14
What is SAX? (2/2) When you parse an XML document with a SAX parser, the parser generates a series of events as it reads the document These events are pushed to event handlers You need to decide what to do with the events when you parse an XML document 15
Sample SAX Events The startDocumen rtDocument event For each element, a startEleme rtElement nt event at the start of the element, and an endElement ement event at the end of the element If an element contains contain, there will be events such as char arac acter ters for additional text The endDocu Document ment event 16
What is StAX? StAX is an exciting new parsing technique Like SAX, it uses an event-driven model However, instead of using SAX’s push model, StAX uses a pull model for event processing Instead of using a callback mechanism, a StAX parser returns events as requested by the application 17
SAX vs. StAX SAX returns different types of event to the ContentHandler StAX returns its events to the application and can even provide the events as objects StAX includes factories for creating the StAX reader and writer Applications can use the StAX interfaces without reference to the details of a particular implementation 18
StAX vs. DOM and SAX StAX specifies two parsing models The cursor model The iterator model Like SAX, the cursor model simply returns events The iterator model returns events as objects Provide a more natural interface but has the additional overhead of object creation 19
DOM vs. SAX (1/3) In the case of DOM, the parser does almost everything Read the XML document in Create an object model on top of it Give you a reference to this object model (a document object) so that you can manipulate it SAX does not expect the parser to do much 20
DOM vs. SAX (2/3) For SAX, the parser should Read in the XML document Fire a bunch of events depending on what tags it encounters in the XML document Then, the programmer needs to make sense of all the tag events and create objects in their own object model 21
DOM vs. SAX (3/3) SAX can be really fast at runtime if your object model is simple SAX is faster than DOM because it bypasses the creation of a tree based object model of your information On the other hand, you have to write a SAX document handler to interpret all the SAX events 22
Drawbacks of DOM Partial parsing is not possible Loading the whole document and building the entire tree structure in memory can be expensive The DOM tree is an order of magnitude larger than the document The generic DOM node type is an interoperability advantage but may not be the best when you do object type binding 23
When to Use DOM When the development needs to be done quickly DOM is quite easy to implement When you need to have random access to the XML document Example: An XSL Processor When you need to modify an XML document Example: An XML Editor 24
Drawbacks of SAX You have to implement the event handlers to handle all incoming events Must maintain event states in your code Must keep track of where the parser is in the document It does not have built-in document navigation support No random access support 25
When to Use SAX When you have a small amount of memory SAX requires little memory because it does not construct an internal representation of the XML data When you need to only read the content in a single pass Example: Many B2B and EAI applications use XML just as an encapsulation format in which the receiving end simply retrieves all the data 26
Drawbacks of StAX It does not have built-in document navigation support No random access support Document modification is still quite difficult if you want to do anything beyond simple one-pass transformations 27
When to Use StAX When applications need to take advantage of the streaming model for performance while maintaining full support of namespaces For an application that can easily request events from multiple StAX parsers and put them into a single context Example: Web services 28
Summary of Java Parser APIs XML parsers are programs to read, manipulate, and create XML documents To automate the XML processing, XML developers need to develop XML parsers XML parsers APIs DOM + Easy for developers to develop + Random access - Requires lots of memory SAX, StAX + Fast processing - Developers need to create their own data model 29
Streaming APIs in PHP ext/xmlreader and ext/xmlwriter Allow for XML to be read or written to/from PHP streams Resulting in very low memory usage But providing very focused and uni-directional XML support (can write or read only) To manipulate XML data tree Using DOM or SimpleXML 30
PHP DOM vs. SimpleXML (1/2) DOM allows a developer to access and manipulate XML in any way needed, but it comes at a price DOM is a large and complex API, requiring a developer to really understand all details SimpleXML aims to break through all the XML complexities and provide an intuitive and simple 31
PHP DOM vs. SimpleXML (2/2) The vast majority of people working with XML are really only concerned with elements having simple content DOM models an XML document as a tree SimpleXML takes an easier approach and views a document as an object Elements are represented as properties and attributes as accessors 32
Recommend
More recommend