[jdom-interest] BUG: XMLOutputter inserts extra empty lines
Andriy Palamarchuk
apa3a at yahoo.com
Fri Nov 30 09:43:23 PST 2001
Bradley S. Huffman wrote:
> > This just goes to prove the adage that all
whitespace handling in XML is a
> > pain.
>
> Yes it is!
Agree, but in this case there is inconsistency in
space handling. If we print with newlines OFF,
everything is compressed in one string. I'd expect
that with newlines ON the only difference will be one
newline before each intendation, but we have these
extra newlines. Additionally, there is no extra
newlines between starting tags.
> This post goes to the question of what does newlines
really imply? It's
> sounds easy at first. We just use newlines,
normalize, and indent to take:
>
> <payroll>
> <employee><firstname>
Brad</firstname><lastname>Huffman</lastname>
> </employee>
> <employee><firstname>John </firstname>
> <lastname>Doe</lastname> </employee>
> </payroll>
There is an interesting for this discussion
description of whitespaces stripping in XSLT
specification:
http://www.w3.org/TR/1999/REC-xslt-19991116
Basically xslt removes the whole text node basing on
two decisions:
1) existence of xml:space attribute (BTW, does JDom
checks for this attribute?)
2) whether node has anything except whitespace
(there are also other xslt-specific considerations)
For details see section 3.4 "Whitespace Stripping".
I assume here that node in this specification is text
between any 2 tags. Please, correct me if my
assumption is wrong.
I suggest to change concept of whitespace handling in
such direction:
1) decide where and which spaces can be removed,
remove them
2) insert intendation and newlines *only* in nodes
where whitespace was removed
Decision (1) can be governed by xml:space attribute
and "whitespaces trimming level".
Or it is better to name it "normalization level"?
Such levels exists:
1) no trimming
2) trim empty text nodes (like described above in xslt
specification
3) (2) + trim leading/trailing whitespace - (risky,
but can work for some XML languages)
4) (3) + trim whitespaces in the text (*very* risky,
not sure if we need this level)
After whitespaces trimming you are free to insert
intendation and newlines in the places of trimmed
whitespaces. As you see this depends on trimming
level.
I think this approach is simple and predictable.
> To make it aesthetically pleasing as in:
>
> <payroll>
> <employee>
> <firstname>Brad</firstname>
> <lastname>Huffman</lastname>
> </employee>
> <employee><firstname>John</firstname>
> <lastname>Doe</lastname>
> </employee>
> </payroll>
Yes, you'd get this on trimming levels 3 or 4 if this
satisfies your XML language, but by default I'd go
with level 2.
> But there are 4 situations where can have text
content, between
> <start><start>, <start></end>, </end><start>, and
</end></end> tags.
I suggest to treat text in the same way, no matter
between what tags it is.
> For the most common case of short text content
between a start and end tag
> a single line is what we want case it looks best.
>
> <firstname>Brad</firstname>
> <lastname>Huffman</lastname>
>
> But then in cases like:
>
> </employee>
> Some randomly spaced text
<employee>
Will be saved AS IS on trimming levels (1) and (2) and
reformatted accordingly
on levels (3), (4). Newlines and intendation does not
affect whitespaces trimming at all - these are added
later.
<skipped>
> Some other possible modes might be:
>
> canonical:
> See http://www.w3.org/TR/xml-c14n. Even
though I think it would
> be better to have a converter to transform
the Document itself
> XMLOutputter is already close to outputting
in canonical form it
> might be worth it to have both.
If you refer to section "3.2 Whitespace in Document
Content" in this document then I don't understand how
it will affect us - it suggests to retain all the
white space.
> line wrap or text wrap?
> wrap a line after so many chars, or maybe
just wrap text.
> Might help with some HTML/XHTML, or might
this functionality be
> better left to something like HTML Tidy.
Do you suggest to insert newlines in text to wrap it?
Newlines and spaces in text are significant for
corresponding level of whitespace trimming.
In general I think this is unnecessary functionality.
I agree with your suggestion to
use corresponding tool, like HTML Tidy or stylesheet
in case of
XML language where newspaces are not significant
(HTML/XHTML).
BTW, even in HTML/XHTML spaces are significant in some
cases,
e.g in <pre></pre> blocks.
> alignText:
> Treat all text content like tags and align
them. Example
> <name>Bradley S. Huffman</name> could
become:
>
> <name>
> Bradley S. Huffman
> </name>
I prefer keeping tags with text on the same line even
for long text - document structure is compact and
still clearly visible.
We need for JDom some simple and predictable
formatting behaviour. If somebody wants to do
something more complex he can use tools, created for
this purpose (XSLT).
What do you think about this?
Andriy Palamarchuk
__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1
More information about the jdom-interest
mailing list