[jdom-interest] SAXBuilder enhancement request /2
Dennis Sosnoski
dms at sosnoski.com
Fri Mar 29 19:19:51 PST 2002
I've advocated this approach for document models and I'm glad to see it
present in dom4j, but I absolutely agree that this should not be the
default!
Most XML usage in Java programs is data centric, though, with whitespace
between elements used only for convenient formatting. A compliant XML
parser gives you the whitespace as content (unless you're validating, in
which case whitespace separating elements may be reported as ignorable
whitespace), resulting in a lot of extra components in the document
tree. This adds substantial overhead without contributing anything
useful as far as the application is concerned.
Many applications also want to ignore leading and trailing whitespace in
character data content. An example of this type of usage is the web.xml
file used by servlet applications. Recent versions of the spec require
implementations to strip all leading and trailing whitespace from the
content of elements.
I'd personally recommend two options - one to discard character data
sequences consisting only of whitespace, the other to strip leading and
trailing whitespace from character data content. It could also be done
using a filter, as ERH suggests, though this might be a little more
complicated - for stripping trailing whitespace you'd need to make sure
you have the entire character data sequence available, rather than just
a portion.
It's worth noting that EXML was silently deleting whitespace between
elements for a very long time without any of its users complaining, as
far as I know. I finally started pointing this out to people in my
performance comparisons because it makes EXML's results look much better
than the code justifies.
- Dennis
Elliotte Rusty Harold wrote:
> At 8:41 AM +0100 3/29/02, phil at triloggroup.com wrote:
>
>> After looking at DOM4J, it appears that these guys added this
>> capability recently ("stripWhitespaceText"). This is
>> effectively very convenient when dealing with data centric document.
>> Can we add it to JDOM?
>>
>
> This makes me very nervous. It's a common misconception that white
> space is insignificant in XML. It's not.
>
> As long as the default is to keep all space, and throwing it away
> requires an explicit client choice, I can live with this, but please
> put big warnings about it in the JavaDoc.
>
> And you'd have to define very carefully what space is kept and what is
> not and document your choice. For instance, do you want to throw away
> all white space? All white-space only text nodes? All ignorable white
> space? These are three different things.
>
> Another thought: maybe what's needed is a more generic builder filter
> operation that could do this and a lot more? SAX filters could
> certainly handle it.
More information about the jdom-interest
mailing list