Developer: XML
by Jinyu Wang Published April 2006
Learn about the concept of XML whitespace, and gets tips for avoiding problems associated with it
Many times, you may not notice that the changes you've made in XML affect how you can access the data in XML documents. For example:
<Author><FirstName>John</FirstName><LastName>Smith</LastName></Author>
is not the same thing as
<Author> <FirstName>John</FirstName> <LastName>Smith</LastName> </Author>
Here's a more complete example (see Example 1 in sample code): Let's assume that you want to get the first child element of <Author> using the DOM APIs as follows:
XMLDocument doc = parser.getDocument(); Element elem = doc.getDocumentElement(); Node node = elem.getFirstChild();
With the default setting of the Oracle XDK DOM parser, the first document returns <FirstName> while the second document returns a text node that is a whitespace node!
Similarly, sometimes the XSLT transformation doesn't generate the results you expected. (See Example 2.) An XML document needs to be transformed using XSLT. The XSL stylesheet uses the position() function to create the ordering information for <Chapter> and <Section> elements:
<?xml version="1.0"?> <Book> <Chapter> <Section/> <Section/> <Section/> </Chapter> <Chapter> <Section/> <Section/> <Section/> </Chapter> </Book>
However, the following XSL stylesheet:
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:template match="*"> <xsl:element name="{local-name()}"> <xsl:attribute name="Position"> <xsl:value-of select="position()"/> </xsl:attribute> <xsl:apply-templates select="@*|node()"/> </xsl:element> </xsl:template> </xsl:stylesheet>
doesn't work that well; it creates the following result:
<?xml version = '1.0' encoding = 'UTF-8'?> <Book Position="1"> <Chapter Position="2"> <Section Position="2"/> <Section Position="4"/> <Section Position="6"/> </Chapter> <Chapter Position="4"> <Section Position="2"/> <Section Position="4"/> <Section Position="6"/> </Chapter> </Book>
The Positions are incorrect because of the whitespaces. If you remove whitespaces before calling the position() function in XSLT transformation, using the following stylesheet:
<?xml version = '1.0' encoding = 'UTF-8'?> <Book Position="1"> <Chapter Position="2"> <Section Position="2"/> <Section Position="4"/> <Section Position="6"/> </Chapter> <Chapter Position="4"> <Section Position="2"/> <Section Position="4"/> <Section Position="6"/> </Chapter> </Book>
The result would be what you expect:
<?xml version = '1.0' encoding = 'UTF-8'?> <Book Position="1"> <Chapter Position="1"> <Section Position="1"/> <Section Position="2"/> <Section Position="3"/> </Chapter> <Chapter Position="2"> <Section Position="1"/> <Section Position="2"/> <Section Position="3"/> </Chapter> </Book>
For this example, if you don't want to strip out the whitespaces for all XML elements, you can instead use <xsl:strip-space element="Book,Chapter, Section"> instead.
In the following sections, you will learn about the concept of XML whitespace along with tips to avoid such problems.
XML considers four characters to be whitespace: the carriage return (\r or ch(13)), the linefeed (\n or ch(10)), the tab(\t), and the spacebar (' '). In XML documents, there are two types of whitespace:
Usually without DTD or XML schema definition, all whitespaces are significant whitespaces and should be preserved. However, with DTD or XML schema definitions, only the whitespaces in the content are significant as follows:
<sig> ------------------ John Smith Product Manager Example.com -------------------- </sig>
XML standards specify how XML processors should handle the whitespace.
XML Parsing: The XML spec provides a built-in attribute xml:space to tell the XML parser whether it should ignore the whitespace characters. This attribute is inherited by child elements from their root element. When declared, it must be given as an enumerated type whose only possible values are " default" and "preserve". If " preserve" is specified, the whitespace within the defined element must be preserved.
Based on the W3C XML specification, the Oracle XML Developer's Kit (XDK) XML parsers, by default, preserves all whitespace. Therefore, the xml:space =" default " or xml:space =" preserve " will have the same result: The whitespace is preserved. To avoid preserving whitespace, you need to set Oracle XDK parsers as follows:
XDK DOM Parser:
DOMParser parser = new DOMParser(); parser.setPreserveWhitespace(false);
SAX Parser:
SAXParser parser = new SAXParser(); parser.setPreserveWhitespace(false);
XSLT Transformation. W3C XSLT specification provides two elements — xsl:strip-space and xsl:preserve-space— to handling whitespaces. The xsl:strip-space specifies the XML elements that should have whitespace text nodes (that is, text nodes composed entirely of whitespace characters) stripped. Note that the xsl:strip-space only affects nodes that are pure whitespace. xsl:strip-space can list as a set of element separated by whitespaces or use wildcards such as *. The xsl:preserve-space has similar syntax but does just the opposite to the xsl:strip-space .
The following example (see Example 3) applies an XSL stylesheet that copies the source document, except that the whitespace text nodes are stripped:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
For an XML document as follows:
<rootElement> <childElement test="true"> Value </childElement>This is the test <childElement test="true" xml:space="preserve"> Value </childElement> <childElement xml:space="preserve"> </childElement> <childElement> </childElement> </rootElement>
The XSLT transformation will give the following result:
<rootElement><childElement test="true"> Value </childElement>This is the test <childElement test="true" xml:space="preserve"> Value </childElement><childElement xml:space="preserve"> </childElement><childElement/></rootElement>
You may notice that the whitespaces are not stripped if the XML elements has xml:space=" preserved" . This behavior is based on the XSLT specification, which defines that the whitespace will be preserved if:
An ancestor element of the text node has an xml:space attribute with a value of preserve , and no closer ancestor element has xml:space with a value of default .
XSLT also provides the normalize-space() function to convert strings of multiple space characters into a single space, deletes any leading and trailing spaces from the string passed to it as an argument.
DOM Serialization. When serializing an XML document, the output indentation adds insignificant whitespace. By default, the Oracle XDK DOM parser prints the XML DOM document with indentation.
To avoid indentation, using XDK 9i, you can sub-class the oracle.xml.parser.v2.XMLPrintDriver class as follows (see Example 4):
import oracle.xml.parser.v2.XMLPrintDriver; import oracle.xml.parser.v2.XMLOutputStream; class MyXMLPrintDriver extends XMLPrintDriver { public MyXMLPrintDriver(java.io.OutputStream A) { super(A); out.setOutputStyle(XMLOutputStream.COMPACT); } }
In Oracle XDK 10g, a new function oracle.xml.parser.v2.XMLPrintDriver.setFormatPrettyPrint() is added to void the subclassing. Using Oracle XDK 10g, you can print the XML DOM document without indentation as follows (see Example 5):
XMLPrintDriver myprint = new XMLPrintDriver(System.out); myprint.setFormatPrettyPrint(false); Xml_doc.print(myprint);
With this basic knowledge, you can now successfully avoid problems caused by whitespace in your XML document.
Jinyu Wang ( jinyu.wang@oracle.com ) is a senior product manager for Oracle XML and an XML evangelist. She is a co-author ofOracle Database 10g XML & SQL (Oracle Press).