[jdom-interest] Internal DTD subset verification

Elliotte Rusty Harold elharo at metalab.unc.edu
Wed May 8 05:44:34 PDT 2002


At 8:36 PM -0700 5/7/02, Philip Nelson wrote:


>No, you are saying that in no case should a programmer have any say in how to
>enforce the "deliberately and rightfully draconian rules of XML". It doesn't
>matter that the characters may have been thourougly screened elsewhere.

Yes, it does. I'm on record as saying that it's OK to skip 
verification on building through a parser.

>Never
>mind that the parser at the other end would do exactly what it is supposed to
>do and reject a document that isn't well formed. In fact, the xml aplication
>model seems to be one where primary responsibility for checking documents is
>given to the parser.  The client's system is protected by their parser, the
>dtd, the schema and this is done when the document is loaded.

Many applications pass information through DOM Document objects, SAX 
event streams, abstract XSLT source trees, serialized objects, TrAX 
Source and Result objects, and other forms that aren't serialized 
bytes that become parsed. We cannot assume there's a parser on the 
other side.

>Producing
>documents has been an afterthought in many apis, often left as string
>manipulation exercises for the programmer.

We can do better than that.

>It simply doesn't make sense to
>define an architecture where you have to verify every single 
>character going in
>and out.

I think it makes as much sense as verifying that every hour field in 
a Time object contains a number between 1 and 12 or that an area code 
field in a PhoneNumber object contain a three digit number. Verifying 
constraints automatically is precisely why we do object oriented 
programming. This is one big reason why classes uses access 
protection and getters and setters rather than direct access to 
fields: so programmers can be assured that objects are always in a 
consistent state and write their code based on that assumption 
without constantly checking to make sure that the time isn't -72 
before using a Time object. The XML constraints are more complex, but 
not fundamentally different.


>Here is the jdom mission statement ...
>
>There is no compelling reason for a Java API to manipulate XML to be complex,
>tricky, unintuitive, or a pain in the neck. JDOM is both Java-centric and
>Java-optimized. It behaves like Java, it uses Java collections, it is
>completely natural API for current Java developers, and it provides a low-cost
>entry point for using XML.
>
>While JDOM interoperates well with existing standards such as the Simple API
>for XML (SAX) and the Document Object Model (DOM), it is not an abstraction
>layer or enhancement to those APIs. Rather, it seeks to provide a robust,
>light-weight means of reading and writing XML data without the complex and
>memory-consumptive options that current API offerings provide.
>
>...
>
>Not much doubt about where the original goals were. Light weight and memory
>consumption are mentioned prominently.

I can't think of how verification adds a single byte to memory 
consumption. Maybe the .class files are a few K bigger? Honestly, 
that's not worth worrying about. Is there any per-object overhead 
anywhere? I can't think of it.

As to speed, a false idol if ever there was one, the worst case 
scenario I've seen so far is 20%. In general, I suspect the 
difference would be a lot less. Keep in mind that in many scenarios 
I/O concerns are likely to swamp any issues with verification, and 
when they don't the speed of the underlying SAX parser is probably 
the second biggest factor.

>Java centric. These goals are what first
>attracted me to JDOM along with the hope for a simpler api to XML, something I
>didn't understand well at all.

Simplicity is far and away the most important concern here, and in 
this verification's impact is very close to zero. In fact, you can 
argue it's positive because it makes a lot of common mistakes 
fail-fast instead of fail-slow. But in their code, programmers can 
ignore the verifier class completely. We might even be able to make 
it package private. A typical JDOM programmer doesn't need to use it 
at all.

>  XML seemed at the time to offer real help in
>the kinds of applications I have made my career doing.  DOM was clunky AND
>untenably slow.

Have you checked out DOM lately? Several implementation have gotten a 
lot better in the last couple of years.

>For those kinds of applications where you type the command and
>can wait while your cpu pegs at 100% for 5-10-30 seconds and you get the
>desired result, the performance is no big deal.  For the kinds of things I
>typically have to deal with, this is not even close to good enough.  If JDOM
>can't live up to the light-weight goal, nothing else will matter.  While the
>DOM api may be clunky to use, most developers can manage it, more 
>tools support
>it, there are more books about it, etc etc etc..  If JDOM doesn't beat DOM in
>memory use and performance, it will have failed in it's most important goals.

No, these are not JDOM's most important goals, nor should they be. 
They are important, yes, but far less so than ease of use and 
correctness. JDOM's biggest selling feature has always been ease of 
use. I've never seen anybody pick JDOM for performance or memory 
reasons. For one thing, it's not at all clear that JDOM is faster or 
uses less memory than modern DOMs like Xerces-2. The benchmarks in 
this area range from abominable to non-existent, and are typically 
written to prove that the author's pet API is better than the 
alternatives.

>>From what I can tell, we don't have a lot of rocks left to uncover to improve
>performance, so a 10-30% hit for verification is really pretty significant.

Would you accept a Date class that gave you the wrong time 
occasionally but was 30% faster? Would you accept an Account class 
that reported the wrong amount of money in the bank account, but was 
300% faster? Why would you want a Document class that allowed 
malformed documents?

>So where is everybody else at here?  I would define a performance hit I could
>live with for verification at %10 or less.

Can you demonstrate that the performance hit is worse than that on average?
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|             http://www.cafeconleche.org/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+



More information about the jdom-interest mailing list