[jdom-interest] Internal DTD subset verification

Wed May 8 18:02:19 PDT 2002

At 11:38 AM -0700 5/8/02, Dennis Sosnoski wrote:
>Not to get involved in the main point of this discussion, but...
>
>Elliotte Rusty Harold wrote:
>
>>  ...Keep in mind that in many scenarios I/O concerns are likely to 
>>swamp any issues with verification, and when they don't the speed 
>>of the underlying SAX parser is probably the second biggest factor.
>
>Not even close. The build time for a document model is much larger 
>than the parsing time for fast parsers. My current published test 
>round, at http://www.sosnoski.com/opensrc/xmlbench/results.html, 
>shows JDOM beta 7 taking about 4 times as long as the SAX2 parse 
>alone for medium to large documents. The SAX2 parsers I was working 
>with had high overhead for small documents, so there the total JDOM 
>build time was only about twice the SAX2 parse time - that should 
>change with Piccolo in the next set of tests.
>

Your tests haven't convinced me. They're a lot of problems with them, 
but most importantly for this case, please allow me to quote from 
your site:

All tests involving I/O use memory buffers to avoid any external 
timing variables. Input and output uses streams (specifically 
ByteArrayInputStream and ByteArrayOutputStream) to most closely 
simulate the normal usage. Some of the models support direct input 
from character arrays or Strings with higher performance than stream 
input, but using this type of input for testing gives misleading 
results; in real world applications, text documents are rarely 
resident in memory to be passed directly to parsers. Validation is 
turned off in all tests, and the documents used for the test do not 
specify DTDs.

In other words, your tests deliberately do not include the cost of 
I/O, which makes sense for what you're doing because I/O would 
indeed swamp what you're trying to test. However, the fact is there's 
not a huge amount of point to us optimizing input that's going to be 
swamped by I/O in any real work scenario.

>Measured memory usage across a variety of documents shows Xerces 
>about on a par with JDOM if you turn off the "deferred node 
>expansion" feature of Xerces. If you *don't* turn this off (it's on 
>by default) both time and memory performance is abysmal for Xerces 
>on small documents.

How are you actually measuring memory usage? I did not find any 
details on your site. Based on the following, it's not obvious to me 
that you're getting accurate counts:

Testing the memory usage of the representations works a little 
differently, in that the program keeps all the constructed copies of 
the document and pauses between relevant tests to encourage garbage 
collection. Memory usage per copy of the representation is found by 
dividing the total memory used by the number of copies.

-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|             http://www.cafeconleche.org/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+