[jdom-interest] Preserving whitespace

Rosen, Alex arosen at silverstream.com
Wed Apr 4 07:50:38 PDT 2001


I made a couple of changes to preserve the whitespace in an XML document, so
that it can be round-tripped better. Here's an illustration of the problem. I'm
using the latest JDOM sources.

------------------------------------

INPUT FILE:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application
2.2//EN" "http://java.sun.com/j2ee/dtds/web-app_2_2.dtd">
<web-app>
	<servlet-mapping>
		<servlet-name>TimeEarJsp</servlet-name>
		<url-pattern>/TimeEarJsp.jsp</url-pattern>
	</servlet-mapping>
</web-app>



OUTPUT:

Validating: false
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application
2.2//EN"
"http://java.sun.com/j2ee/dtds/web-app_2_2.dtd"><web-app><servlet-mapping>
                <servlet-name>TimeEarJsp</servlet-name>
                <url-pattern>/TimeEarJsp.jsp</url-pattern>
        </servlet-mapping></web-app>

Validating: true
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application
2.2//EN"
"http://java.sun.com/j2ee/dtds/web-app_2_2.dtd"><web-app><servlet-mapping><serv
let-name>TimeEarJsp</servlet-name><url-pattern>/TimeEarJsp.jsp</url-pattern></s
ervlet-mapping></web-app>

------------------------------------

There are three problems: (1) In the validating case, all whitespace is lost.
(2) In the non-validating case, the space between the DOCTYPE declaration and
the first element is lost, and (3) the space between the first and second
element is lost (both at the start tags and the end tags).

To fix problems 1 and 3, I added this method to SAXHandler, to handle ignorable
whitespace identically to other character data. This is necessary because the
parser (Xerces at least) will act differently depending on whether validation
is on or off, returning the same character data via ignorableWhitespace() in
one case and returning it via characters() in the other. And, for some reason,
the whitespace between the first and second tags is being reported as
ignorable, even when not validating.

    public void ignorableWhitespace(char[] ch,
                                    int start,
                                    int length)
                             throws SAXException
    {
        characters(ch, start, length);
    }

(Do we need to allow this to be turned off?)

To fix problem 2, I changed the DOCTYPE outputter in
XMLOutputter.output(Document, Writer) to:

        if (doc.getDocType() != null) {
            printDocType(doc.getDocType(), writer);
            // Print new line after doctype always - same reason as above
            writer.write(lineSeparator);
        }

This always prints a newline after the DOCTYPE, for the same reason as given
for printing a newline after the XML declaration: "Helps the output look better
and is semantically inconsequential".

--Alex


P.S. Here's the test code:

import org.jdom.*;
import org.jdom.output.*;
import org.jdom.input.*;
import java.io.*;
import org.xml.sax.*;

public class JDOMTest
{
	public static void main(String[] args)
	{
		parse(args[0], false);
		parse(args[0], true);
	}

	private static void parse(String file, boolean validate)
	{
		try
		{
			System.out.println("Validating: " + validate);

			SAXBuilder builder = new SAXBuilder();
			builder.setValidation(validate);
			Document d = builder.build(new File(file));

			XMLOutputter outputter = new XMLOutputter();
			outputter.output(d, System.out);

			System.out.println();
		}
		catch(Exception ex)
		{
			ex.printStackTrace();
		}
	}
}



More information about the jdom-interest mailing list