XML Basics Chapter 2 XML Basics Objectives • Introduce XML concepts • Introduce the technologies for describing XML – DTD and XML Schema • Discuss.
Download ReportTranscript XML Basics Chapter 2 XML Basics Objectives • Introduce XML concepts • Introduce the technologies for describing XML – DTD and XML Schema • Discuss.
XML Basics
Chapter 2 XML Basics Objectives
• Introduce XML concepts • Introduce the technologies for describing XML – DTD and XML Schema • Discuss how to parse XML in Java using SAX, DOM, and JAXP • Introduce two alternative APIs to SAX and DOM – JDOM and dom4j • Introduce XSL Transformations and how to process XSLT in Java • Give some step by step examples of parsing and manipulating XML in Java
XML Overview
• eXtensible Markup Language (XML) is a language for defining markup languages • HTML is an example of a well known markup language • Tags in XML are defined by the author whereas tags in HTML are predefined by the W3C standard • XML provides a portable (cross-platform) method for encapsulating and describing data • An XML document is composed of
elements
consisting of opening and closing
tags
(
Example XML Document
XML Overview, cont.
• First line is document prolog: • Single root element,
attributes
:
XML Prolog
• Appears before the root of the XML document (e.g.,
XML declaration Document Type Declaration (DTD)
later and a – discussed • XML declaration consists of the optional properties
version
,
encoding
, and
standalone
.
• Version describes the XML version used, encoding describes the character encoding used, and standalone provides a hint to the XML processor that no other files need to be loaded
XML Elements
• An XML element is an XML tag and the data it encapsulates (e.g., spaces)
data
, or - empty is OK) or can be self-closing (e.g., ).XML Element Attributes
• The opening tag of an XML element may contain attributes (e.g.,
• Attribute values are separated from the attribute name by an equal sign and must be enclosed in quotation marks, either the straight double quote ( " ) or the apostrophe ( ' ).
• For every attribute there must be a value, even if the value is an empty string • No duplicate attributes within a single element
XML Syntax
• Comments start with , can not include a string of consecutive dashes (e.g., - ), and may appear anywhere within the document because they are not XML elements.
• XML processing instructions begin with
, immediately followed by a legal XML name called the
target
with ?> (e.g., and end ). XML processors are designed to recognize certain targets and execute specific logic.
• Some characters in XML data must be replaced with their
character entities
because they otherwise interfere with a parser’s ability to recognize which parts of the document are elements and attributes and which parts are data. The five characters and their predefined character entities are: < ( < ), > ( > ), & ( & ), " ( " ), ' ( ' )
XML Syntax, cont.
• Character entities represent a single character for which, possibly, no keyboard combination exists (such as à). They can be used only in text, not in element or attribute names. They can be numbered (e.g., à ) or named (e.g., à ). The number in numbered entities represents a code point in the Unicode set.
• Enclosing text and possibly markup in a CDATA section instructs the XML parser not to attempt to parse it. A CDATA section begins with the markup . A CDATA section may contain any characters except the CDATA ending sequence.
Well Formed XML Documents
• A well formed XML document conforms to the syntax rules of XML • Unlike HTML parsers, XML parsers must report errors and may not replace missing quotes, close unclosed tags, or silently rearrange overlapping tags based on an assumption about the intended meaning.
• Some commonly abused XML syntax rules are: 1) Element and attribute names must be legal XML names; 2) Characters < and & must be escaped as character entities when used in text; 3) Every element must be closed; 4) Attributes must have values and values must be delimited with quotation marks; 5) Every element except the root element must be the child of exactly one element; 6) Comments must be properly formed, in particular, a comment may not contain the string “ - ”
Namespaces
• Use XML Namespaces to prevent name collisions among element and attribute names, which can be caused by designers choosing their own element names that conflict with imported elements defined in other XML documents.
• Namespaces are declared by adding an xmlns an element where the value of the xmlns attribute to attribute is a unique URI (not necessarily a valid URL).
• The element with the xmlns attribute and all of it’s children (nested elements) inherit the namespace; others that are not nested are not affected.
• xmlns can also be used repeatedly with different
qualifiers
, e.g., . Then use the prefix to associate a namespace with a specific element (qualify):
Namespaces, cont.
• A
default namespace
unqualified xmlns can be defined by using an on the root element of the XML document, e.g., . Unqualified elements and attributes (names without prefixes) fall under the default namespace.
• Valid XML requires the root element of an XML document to be qualified, but other elements need not be. Best practice is to make sure that all of the elements in an XML document are qualified, either by the default namespace or explicitly by a prefix.
• Support for namespaces has to be built into the application that processes the XML. It is up to the application processing the XML to recognize namespaces, map the namespace URI to the identifying prefix, and process elements correctly depending upon their namespace.
Validating XML Documents
• A well formed document conforms to the syntax rules of XML, but it is not necessarily valid in the context of a particular application. For instance, a well formed XML document describing an invoice is probably not valid in the context of an application dealing with a catalog of books. • If no formal document model is defined for an XML document, the document must still be well formed, but there are no limits on the element names used, the structure or contents of the elements, or the use of attributes. For complex documents or documents that will be used across organizational boundaries, a more formal definition of validity is needed.
• Two popular solutions are
Document Type Definition (DTD)
and
XML Schema
Document Type Definition (DTD)
• A formal, machine-readable specification that defines the structure of an XML document and provides some information about the required content • DTDs Provides syntax for declaring elements, attribute lists, entities, and notations • DTD element declarations begin with the opening delimiter sequence
, followed by one of the four keywords ELEMENT , ATTLIST , ENTITY , or NOTATION , then a case sensitive element name, a content description, and end with > , e.g., • A DTD is a sequence of these declarations enclosed in a DOCTYPE declaration or stored separately and referred to from a DOCTYPE
DTD in XML Prolog (Internal Subset)
]>
External Subset DTD
• An
external subset
DTD is specified in the declaration using the SYSTEM keyword DOCTYPE • The DTD definition is stored in its own file, and the XML document looks like the following:
Document Type Definition (DTD), cont.
• DTDs can be declared as PUBLIC rather than SYSTEM where a unique name is specified as the URI and a URI is supplied following the DTD unique name (typical practice) • Both external subset and internal subset DTDs may be used in the same document • Operators and keywords available for declaring content descriptions: , (comma) ordered list (and operator) | ( ) or operator content grouping ?
+ preceding item may occur zero or one time preceding item may occur one or more times * #PCDATA preceding item may occur zero or more times parsed character data EMPTY ANY element may not contain content element may contain any content
Document Type Definition (DTD), cont.
• Attributes are declared with an declaration, which contains the element name for which the attributes are being declared followed by a list of attribute declarations. Each declaration includes the attribute name, a data type specification, and a default definition that tells whether the attribute is required and if not, what action the parser should take. Example: • In this example, the date attribute is required ( #REQUIRED ) and is of type CDATA (character data). The attribute is an enumerated type. The values for the enumeration must be legal XML names, enclosed in parentheses, and separated by | priority operators. The quoted string "medium" in the example makes the priority attribute optional with a default value of medium .
Document Type Definition (DTD), cont.
• Data type definitions and their meanings: CDATA ID IDREF , - Character data (enumerated list) - List of permitted values - Unique legal XML name IDREFS - Element ID or list of IDs NMTOKEN , NMTOKENS ENTITY , ENTITIES - One or list of name tokens - One or list of unparsed entities NOTATION • Attribute default definitions and their meanings: #REQUIRED - Previously declared notation name - Required to be present #IMPLIED (quoted string) - Optional and defaults to given value #FIXED (quoted string) - Always the given value • Entities can also be defined. For example, this declaration will cause - Optional &boss; to be replaced by “Harry S Truman”:
XML Schema
• Created to solve DTD shortcomings.
• DTDs have a very week typing system that can only restrict XML elements to contain no data, other XML elements, or text data • DTDs do not support data types like integers, decimals, booleans, dates, or enumerations • DTDs do not allow one to specify that the data appear in a specific format.
• DTDs do not support namespaces • DTDs use a different syntax than the XML documents they describe • An XML
s
chema is an XML document that conforms to the XML
S
chema specification
XML Schema, cont.
• Binding an XML schema to an XML document is done via attributes in the root XML element.
• xmlns:xsi attribute declares the XML Schema namespace • xsi:schemaLocation attribute declares the location of the XML schema document being used • xmlns attribute declares the default namespace being used, which is defined in the schema document
XML Schema, cont.
• The XML schema definition defines the XML elements and attributes including their structure and the data types they support.
XML Schema, cont.
• The root XML element for the XML schema definition is
• The xmlns attribute of the schema definition binds the namespace prefix xsd to the version of XML Schema being used, in this case http://www.w3.org/2001/XMLSchema .
• The targetNamespace for the XML schema is the namespace for the elements and attributes defined by the schema definition. When this schema is referenced by another XML document, the targetNamespace will be used to qualify the elements defined by this schema.
• The elementFormDefault attribute set to "qualified" indicates that nested elements in the XML document instance must be namespace qualified; default is unqualified.
XML Schema, cont.
• XML elements are defined using the
tag • The name and type attributes are used to define the element/attribute name and data type, respectively.
• Elements can be defined as either complexType or simpleType , attributes can only be simpleType . Simple types can have neither attributes nor child elements. Complex types can have either.
• XML Schema defines many built-in atomic types including strings, numbers, dates, and times.
• The built-in atomic types can be further constrained by a
derived
simple type specifying
facets
using the
XML Schema, cont.
• Example complexType and simpleType :
Parsing XML
•
XML parsers
are software programs that know how to read and manipulate XML documents.
• The most popular XML parser APIs today are the Simple API for XML (SAX) and the Document Object Model (DOM).
• DOM caches the parsed XML in memory, SAX does not.
• SAX sends events to registered listeners when it encounters data while parsing an XML document, but does not store the parsed data in memory. Listeners must cache data if they want to keep it around.
• DOM parses the entire XML document into memory as a hierarchical object model – a tree structure of objects called
nodes
.
• SAX parsers tend to consume less memory and be faster than DOM parsers. However, DOM provides the benefit of being able to randomly access the parsed document.
Simple API for XML (SAX)
• SAX is an event-driven model where you provide objects with callback methods (sometimes called listeners) that the parser invokes as it reads data from an XML document. Access to the XML data is provided by a SAX parser in a serial fashion. Meaning once a piece of data is read and provided to a callback method, that data is not read again.
• SAX interfaces with callback methods that have to be implemented by the developer: org.xml.sax.ContentHandler
– The primary listener that you will implement for most all applications that use SAX. Called by the parser to notify the listener that data (content) was encountered.
org.xml.sax.ErrorHandler
– Called when an error is encountered. SAX parsers do not throw exceptions.
org.xml.sax.DTDHandler
used – Provides information about a DTD being org.xml.sax.EntityResolver
an external entity – Used to get external data defined by • org.xml.sax.helpers.DefaultHandler
implements the methods of all four interfaces so one only has to extend this class
SAX Example
import org.xml.sax.*; import java.io.*; public class SAXTest { public static void main(String[] args) throws Exception { String xhtmlFileName = args[0]; String contentHandlerClass = args[1]; ContentHandler contentHandler = (ContentHandler) Class.forName(contentHandlerClass).newInstance(); XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setContentHandler(contentHandler); reader.setFeature( "http://xml.org/sax/features/namespace-prefixes", true); reader.setFeature( "http://xml.org/sax/features/validation", true); reader.setFeature( "http://apache.org/xml/features/validation/schema", true); reader.setProperty( "http://apache.org/xml/properties/schema/external-schemaLocation", "http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"); String uri = "file:" + new File(xhtmlFileName).getAbsolutePath(); InputSource input = new InputSource(uri); reader.parse(input); } }
SAX ContentHandler Implementation
import org.xml.sax.*; public class ContentHandlerExample extends DefaultHandler { StringBuffer buffer = new StringBuffer(); boolean foundTag = false; boolean processTag = false; public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { System.out.println(“startElement() called for tag: ” + localName); if (!foundTag && localName.equals(“p”)) { foundTag = true; processTag = true; } } public void characters(char[] chars, int start, int length) throws SAXException { System.out.println(“characters() called”); if (processTag) { buffer.append(chars, start, length); } } public void endElement(String uri, String localName, String qName) throws SAXException { System.out.println(“endElement() called for tag: ” + localName); if (processTag) { processTag = false; System.out.println(“Content of first paragraph: ” + buffer.toString()); } } }
SAX Example: Input File and Output
Heading Content
First Paragraph Content
Second Paragraph Content
startElement() called for tag: html characters() called startElement() called for tag: head startElement() called for tag: title characters() called endElement() called for tag: title endElement() called for tag: head characters() called startElement() called for tag: body characters() called startElement() called for tag: h1 characters() called endElement() called for tag: h1 characters() called startElement() called for tag: p characters() called endElement() called for tag: p Content of first paragraph: First Paragraph Content characters() called startElement() called for tag: p characters() called endElement() called for tag: p characters() called endElement() called for tag: body characters() called endElement() called for tag: htmlDocument Object Model (DOM)
• After parsing an XML document, DOM parsers generate a XML elements are parsed into Element Document object, which represents an entire XML document instance containing references to all of the other objects generated by the DOM parser. objects, XML attributes into Attr objects, text into Text objects, and so on. The common supertype for all of the XML artifacts is the Node type.
package org.w3c.dom
DOM 2 Interface
DOM 2 Class java.lang Class
Node CharacterData Document Element Attr DocumentFragment Entity Notation Text Comment EntityReference DocumentType ProcessingInstruction CDATASection NodeList NamedNodeMap DOMImplementation
RuntimeException DOMException
Document Object Model (DOM), cont.
• DOM is a tree-based model where the entire XML document is parsed and cached in memory as a tree structure of objects called
nodes
. For example:
Document Element
html
Attr
xmlns=”http://www.w3.org/1999/xhtml”
Element
head
Element
title
Text
XHTML Test
Element
h1
Text
Heading Content
Element
body
Element
p
Text
First Paragraph Content
Element
p
Attr
id=”p2”
Text
Second Paragraph Content
DOM Node Methods
• Methods to obtain/set type-specific information: getNodeType() , getNodeName() , getNodeValue() , setNodeValue() • Methods to obtain/set the XML namespace information: getLocalName() , getNamespaceURI() , getPrefix() , setPrefix() • Methods to reference the attributes: hasAttributes() getParentNode() , , getAttributes() • Methods to get references to the Node’s parent, siblings, and children.
hasChildNodes() , getFirstChild() , getLastChild() , getChildNodes() , getPreviousSibling() , getNextSibling() • Methods to add or remove children: appendChild() , replaceChild() , removeChild() , insertBefore() • If you call Element getFirstChild() getFirstChild() on that reference to Element head.
on the reference to the top level XML element of the document, which is html in the previous example. If you call Node ( Document Element , you will get a html), you will get a
DOM Example
import java.io.IOException; import org.w3c.dom.*; import org.xml.sax.SAXException; import org.apache.xerces.parsers.DOMParser; public class DOMTest { public static void main(String[] args) throws Exception { String xhtmlFileName = args[0]; DOMParser parser = new DOMParser(); parser.parse(xhtmlFileName); Document document = parser.getDocument(); Node rootNode = document.getFirstChild(); Element htmlElement = (Element) rootNode; NodeList childNodes = htmlElement.getChildNodes(); Element bodyElement = null; for (int i = 0; i < childNodes.getLength(); i++) { if (childNodes.item(i).getNodeName().equals("body")) { bodyElement = (Element) childNodes.item(i); break; } } childNodes = bodyElement.getChildNodes(); Element secondParagraphElement = null; int count = 0; for (int i = 0; i < childNodes.getLength(); i++) { if (childNodes.item(i).getNodeName().equals("p") && (++count == 2)) { secondParagraphElement = (Element) childNodes.item(i); } } Text secondParagraphContent = (Text) secondParagraphElement.getFirstChild(); System.out.println(secondParagraphContent.getNodeValue()); } }
Java API for XML Processing (JAXP)
• • • • • API that provides an abstraction layer to XML parser implementations (specifically implementations of DOM and SAX), and applications that process Extensible Stylesheet Language Transformations (XSLT) JAXP is is a layer above the parser APIs that makes it easier to perform some vendor-specific tasks in a vendor-neutral fashion. JAXP employs the Abstract Factory design pattern to provide a
plugability
layer, which allows you to plug in an implementation of DOM or SAX, or an application that processes XSLT The primary classes of the JAXP plugability layer are javax.xml.parsers.DocumentBuilderFactory
, javax.xml.parsers.SAXParserFactory
, and javax.xml.transform.TransformerFactory
.
Classes are abstract so you must ask the specific factory to create an instance of itself, and then use that instance to create a javax.xml.parsers.DocumentBuilder
, javax.xml.parsers.SAXParser
, or javax.xml.transform.Transformer
, respectively.
DocumentBuilder SAXParser abstracts the underlying DOM parser implementation, the SAX parser implementation, and Transformer the underlying XSLT processor. DocumentBuilder , SAXParser , and Transformer are also abstract classes, so instances of them can only be obtained through their respective factory.
JAXP Example
import java.io.*; import javax.xml.*; import org.w3c.dom.Document; import org.xml.sax.SAXException; import javawebbook.sax.ContentHandlerExample; public class JAXPTest { public static void main(String[] args) throws Exception { File xmlFile = new File(args[0]); File xslFile = new File(args[1]); File xsltResultFile = new File(args[2]); DocumentBuilderFactory docBuilderFactory=DocumentBuilderFactory.newInstance(); docBuilderFactory.setNamespaceAware(true); docBuilderFactory.setValidating(true); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); Document doc = docBuilder.parse(xmlFile); SAXParserFactory saxParserFactory = SAXParserFactory.newInstance(); saxParserFactory.setNamespaceAware(true); saxParserFactory.setValidating(true); SAXParser saxParser = saxParserFactory.newSAXParser(); saxParser.parse(xmlFile, new ContentHandlerExample()); TransformerFactory transformerFactory = TransformerFactory.newInstance(); Source xslSource = new StreamSource(xslFile); Transformer transformer = transformerFactory.newTransformer(xslSource); Source xmlSource = new StreamSource(xmlFile); Result xsltResult = new StreamResult(xsltResultFile); transformer.transform(xmlSource, xsltResult); } }
JDOM and dom4j
• DOM is useful but can be awkward to use because it was designed to be independent of any programming language.
• Implementations that take advantage of the strengths of Java can be easier to use. Examples are JDOM (http://jdom.org) and dom4j (http://dom4j.org).
• Both JDOM and dom4j are open source and can be used with JAXP • Both APIs take advantage of built-in Java classes, provide an object model to represent an XML tree, are intuitive and easy to use, integrate well with SAX and DOM, support XPath, and are more efficient than DOM.
• JDOM is built on concrete classes and dom4j on interfaces.
• dom4j is more flexible, yet more complex.
• dom4j additional features over JDOM like event-based processing for handling very large documents or streamed documents.
• dom4j also aims to be a more complete solution than JDOM, whose goal is to solve only about 80% of the Java/XML problems.
Transforming XML Using XSLT
• Extensible Stylesheet Language Transformations (XSLT) are part of the XML Stylesheet Language (XSL).
• An XSLT stylesheet, which is simply and XML document, contains instructions on how an XML document should be transformed by an XSLT processor. XSLT is a full programming language, expressed as XML, designed specifically for reformatting XML documents. There are more than 50 XSLT elements and more than 200 attributes.
• XSL Transformations provide a way to translate the semantic descriptions of an XML document to presentational descriptions, e.g., translate XML to HTML.
• XSL Transformations allow XML data to be reordered, permit the display of attributes, and allow elements to be displayed in an order other than that in which they are given in the XML document. XSL Transformations can also add static data to the output, such as XHTML tags and CSS style specifications.
XSLT, cont.
• Writing an XSLT stylesheet simply involves writing templates for those elements that are to be a part of the output. The XSLT processor traverses the supplied XML document tree looking for elements that match these templates. Templates may include XML element and attribute contents, other markup, such as XHTML tags, and other literal and computed values. For example:
XSLT, cont.
• An XSLT processor that was supplied this XSLT stylesheet and a valid XML document would, for each
.
•
•
XSLT, cont.
•
• The default behavior of XSLT is to copy the element values and whitespace outside of elements to the output document. A template at the outermost level can be used to specify which inner elements are to be used, and in what order.
XPath
• XPath expressions look a lot like directory path expressions for operating systems; both describe a path through a tree structure. Absolute paths begin with a / and start at the root element of the document. The XSLT processor traverses the input document in preorder fashion and keeps track of its current position. The current position is called the context node, and it is referred to with a period. Relative path specifications are relative to the context node and do not begin with a slash. Possible XPath specifications and their meanings:
/ .
– The root of the document.
– Contents of the current context node.
to/text()
– Contents of the text node of the
/memo/to/name
/memo/to .
//surname
– All –
/memo/to/name[1] /memo/to/name[last()]
elements that are a child element of – First
@date
.
– Contents of the date
/memo@date
– Contents of the elements, even if at different levels.
date element child element of /memo/to .
child element of attribute of the
• XPath also includes functions and operators, which were’nt discussed.