[jdom-interest] Deprecating some XMLOutputter constructors
Alex Rosen
arosen at silverstream.com
Wed Jun 20 10:13:15 PDT 2001
Jason said:
> <foo>
> This is text and it continues
> onto the next line
> </foo>
>
> Trimming and adding new lines will not pretty print this.
You're assuming that this particular element wants to be pretty-printed in that
way. It's probably good to make this work, but we need to realize that this is
a particular, stylized usage of XML, where you're allowing people to add
internal whitespace that doesn't affect the meaning. We must be able to also
pretty-print XML where text normalization would lose important information.
(The output might not be quite as pretty, but it won't lose any important
info.)
Alex C. said:
>> (2) I guess I'm surprised that we try to do anything to character data
>> that's not just whitespace.
>
> We don't, unless told to via setTextNormalize(true).
We do add whitespace around character data - in your example, "\n my honey\n"
will get output as "\n \n my honey\n\n " or something. For many applications,
that's changing the meaning.
I think that we all have different requirements that we're not really making
clear. I think there are two occasions when you'd want to use pretty-printing:
(1) Dumping to System.out for debugging purposes. Here, the primary goal is
readable output. It would be good if the output XML was as close to the "real"
in-memory XML as possible, but you might make some trade-offs of accuracy for
readability. For example, even if whitespace is significant for your app, you
may still want to normalized whitespace (or even word wrap), so it's easier to
read, even though this changes the meaning of your data.
(2) Writing to a file that is user-readable. You want to make the file easy to
read and edit, but you do NOT want the meaning to change. For example, I might
read in a config file, modify it in some way, and write it back out. If I
programmatically add some nodes, I want to pretty-print it so that it doesn't
look like this when I write it:
<root>
<existing-node1>yellow</existing-node1>
<existing-node2>red</existing-node2><new-node1>green</new-node1><new-node2>blue
</new-node2>
</root>
Of course, for case 2, what's needed to preserve meaning of the XML depends on
your app. Text normalization is fine for XHTML, but not other formats. I think
this comes down to two variables: does whitespace normalization lose data, and
is mixed content used? For data-oriented applications, usually mixed content is
not used, and whitespace normalization would be bad. For document-oriented
applications, mixed content is often used, but I don't know if there's a common
convention for whitespace normalization. For XHTML, whitespace normalization is
fine, but is that the case for DocBook or whatever?
Here's the algorithm I'd use for pretty-printing: For any element that has
child elements, replace any text nodes that consist only of whitespace (or that
are empty) with the necessary newline+indent. (In other words, we only modify
whitespace-only mixed content nodes.) There would also be a separate option to
normalize whitespace.
For data-oriented applications (w/o mixed content), this will guarantee that
the meaning is unchanged, since only mixed-content text nodes would be
modified. If you're just dumping out the text for debugging, you might turn on
whitespace normalization, to make it easier to read.
For document-oriented applications where whitespace is unimportant, it would be
fine too, and the normalization option would solve Jason's case. For
document-oriented applications where whitespace is important (i.e.
whitespace-only text nodes need to be preserved), you shouldn't be trying to
pretty-print anyway, because pretty-printing has to change your whitespace
somehow.
I think that this would be no worse than the current scheme in all situations,
and better in some. Anything I've missed?
Alex
P.S. to combine several previous examples into one, this new scheme would cause
this XML:
<hello>
my honey
<hello>my baby</hello><hello>my ragtime gal...</hello>
<foo>
This is text and it continues
onto the next line
</foo>
</hello>
to be pretty-printed like this, with whitespace normalization turned off:
<hello>
my honey
<hello>my baby</hello>
<hello>my ragtime gal...</hello>
<foo>
This is text and it continues
onto the next line
</foo>
</hello>
and like this, with whitespace normalization turned on:
<hello>my honey
<hello>my baby</hello>
<hello>my ragtime gal...</hello>
<foo>This is text and it continues onto the next line</foo>
</hello>
[In order to get "my honey" to work for that last one, I think I have to modify
the algorithm a bit. If text normalization is on, we should add the
newline+indent for ALL mixed-content text nodes, not just whitespace-only
ones.]
More information about the jdom-interest
mailing list