<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
<META NAME="GENERATOR" CONTENT="GtkHTML/3.0.9">
</HEAD>
<BODY>
On Thu, 2004-12-09 at 11:59, Bradley S. Huffman wrote:
<BLOCKQUOTE TYPE=CITE>
<PRE><FONT COLOR="#737373"><I>Ken Roberts writes:
> On Thu, 2004-12-09 at 06:38, Elliotte Harold wrote:
>
> > setIgnoringAllWhitespace() is the wrong name for this functionality. Do
> > you really want to throw away all white space?
> > Eveninrecordlikedocumentsthiscouldbeveryhardtoread. I think what you
> > really want to do is throw away all text nodes that consist of white
> > space exclusively, but retain all white space in text nodes that contain
> > any non-whitespace characters. The correct name for this method would
> > be setIgnoringBoundaryWhitespace(). The functionality proposed is fine.
> > I just want to make sure we get the name right.
>
>
> What something like this should do is convert an infinite amount of
> whitespace in a single instance into a single space. Not sure about
> "middle" text, but an equivalent of String.trim() would probably be OK
> anywhere if you choose this option. Keep in mind that it's an OPTION
> rather than a change in default behavior.
You have to be careful when trimming whitespace or something like
<p>This is a
<i> test</i>
sentence. </p>
could end up as
<p>This is a<i>test</i>sentence.</p>
which may or may not be what is really desired.
Brad
</I></FONT></PRE>
</BLOCKQUOTE>
<BR>
That's true. I'm not sure how the parsing works in jdom, but if I were writing a c or java parser, when you tokenize it the tokens are all separated correctly even with the short string.<BR>
<BR>
What I was getting at is that if I were to choose a method named as the one being discussed, my intent would be to minimize whitespace. In other words, I would care that there was whitespace between two tokens, just not how much.<BR>
<BR>
One could convert all sequences of whitespace into a single space, but then when you parse your above example you would get:<BR>
<BR>
<p>This is a <i> test</i> sentence.</p><BR>
<BR>
When you took care of the italics, there would still be two spaces between "a" and "test". If one were converting to HTML this would not matter in the least, but if you're parsing a document and expect there to be only one space in any whitespace, you would not get the correct result.<BR>
<BR>
</BODY>
</HTML>