<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/3.0.9">

</HEAD>

<BODY>

On Thu, 2004-12-09 at 11:59, Bradley S. Huffman wrote:

<BLOCKQUOTE TYPE=CITE>

<PRE><FONT COLOR="#737373"><I>Ken Roberts writes:

&gt; On Thu, 2004-12-09 at 06:38, Elliotte Harold wrote:

&gt; 

&gt; &gt; setIgnoringAllWhitespace()  is the wrong name for this functionality. Do 

&gt; &gt; you really want to throw away all white space? 

&gt; &gt; Eveninrecordlikedocumentsthiscouldbeveryhardtoread. I think what you 

&gt; &gt; really want to do is throw away all text nodes that consist of white 

&gt; &gt; space exclusively, but retain all white space in text nodes that contain 

&gt; &gt;   any non-whitespace characters. The correct name for this method would 

&gt; &gt; be setIgnoringBoundaryWhitespace(). The functionality proposed is fine. 

&gt; &gt; I just want to make sure we get the name right.

&gt; 

&gt; 

&gt; What something like this should do is convert an infinite amount of

&gt; whitespace in a single instance into a single space.  Not sure about

&gt; &quot;middle&quot; text, but an equivalent of String.trim() would probably be OK

&gt; anywhere if you choose this option. Keep in mind that it's an OPTION

&gt; rather than a change in default behavior.

You have to be careful when trimming whitespace or something like

    &lt;p&gt;This is a 

              &lt;i&gt;   test&lt;/i&gt;

       sentence.   &lt;/p&gt;

could end up as

    &lt;p&gt;This is a&lt;i&gt;test&lt;/i&gt;sentence.&lt;/p&gt;

which may or may not be what is really desired.

Brad

</I></FONT></PRE>

</BLOCKQUOTE>

<BR>

That's true.&nbsp; I'm not sure how the parsing works in jdom, but if I were writing a c or java parser, when you tokenize it the tokens are all separated correctly even with the short string.<BR>

<BR>

What I was getting at is that if I were to choose a method named as the one being discussed, my intent would be to minimize whitespace.&nbsp; In other words, I would care that there was whitespace between two tokens, just not how much.<BR>

<BR>

One could convert all sequences of whitespace into a single space, but then when you parse your above example you would get:<BR>

<BR>

&lt;p&gt;This is a &lt;i&gt; test&lt;/i&gt; sentence.&lt;/p&gt;<BR>

<BR>

When you took care of the italics, there would still be two spaces between &quot;a&quot; and &quot;test&quot;.&nbsp; If one were converting to HTML this would not matter in the least, but if you're parsing a document and expect there to be only one space in any whitespace, you would not get the correct result.<BR>

<BR>

</BODY>

</HTML>