[jdom-interest] Internal DTD subset verification

Tue Apr 30 19:20:01 PDT 2002

At 7:24 PM -0400 4/30/02, Alex Rosen wrote:

>Currently we're verifying the data in every JDOM object that gets created.
>It's obvious that this is wasteful. The parser is required to check every
>character of the document for validity, and we're checking everything again
>afterwards. Ugh.

Not quite so obvious. There are parsers that fail to test everything 
they're supposed to test. For instance, the JDOM character checks you 
dislike are substantially more accurate than what at least one major 
parser does. And it's not obvious that turning off the redundant 
checks would really help parse time. This is going to be dominated by 
I/O in any case. However, I could live with this if profiling proved 
it were helpful.

>If we ignore that, and assume that we come up with some mechanism to turn
>off verification for parsing, then we're left with verification when you're
>building JDOM objects programmatically. How often is this verification
>useful? First let's think about names. Very commonly, the names of elements
>and attributes are fixed - e.g. stored as String constants in a program. If
>this program creates 1000 documents, these names will currently be checked
>1000 times, even though they never change. Worse, each name may get checked
>dozens or hundreds of times in each document, if there are repeating
>elements or attributes. (We're talking about the exact same String object,
>getting checked over and over again.) The document type you're building may
>have an internal DTD subset that you've stored as a constant String. Now
>we're saying that in an "ideal" world, this String would actually be parsed
>every time? Ugh.

It might be possible to come up with a table that stored previously 
checked strings in Verifier that checkXMLName() could consult. I'm 
not sure if it would really help. I suggest leaving it for 1.1. 
Turning off verification in tests really has not proven to greatly 
speed anything up that I've seen. Maybe 20% at most, and probably 
less than that.

>Now let's consider text content, in attribute values and element content.
>This content is variable much more frequently - it may even come from user
>input. But the checks we do on this content is just that it contains legit
>XML characters - that it doesn't contain any nulls, or vertical tabs, or
>invalid Unicode characters. I can't even imagine how you'd write an app that
>accidentally allowed any of these characters to sneak in.  The vast majority
>of the time, this'll just be a waste.

I totally disagree. I see users trying to sneak this content into XML 
documents all the time. Just last week I met somebody who was 
complaining because they couldn't fit the null bytes from their 
database into XML. If we let people do it, some people will do it. 
This would be bad for the XML community at large. We do not want JDOM 
to be responsible for polluting the XML environment. If we allow it, 
the accidental pollution will arise almost immediately because 
programmers will write code that reads existing data that contains 
illegal characters and stuff it into JDOM text. This can occur both 
because the original, non-XML data contains control characters or 
because the programmers screwed up the encoding of the data they're 
reading, or (more likely) didn't pay attention to the encoding at all.

>Two of the philosophies of the design of C++ were "you don't pay for stuff
>you don't use", and "trust the programmer". These aren't quite as central to
>the philosophy of Java, but I think they're still useful to consider, and it
>seems like we're almost going out of our way here to break these rules.
>

They're even less part of the design of XML. XML is deliberately 
draconian. There are very good reasons for XML to be inflexible about 
what it allows, even at the cost of convenience, even at the cost of 
performance.

>The other solution would be to make the verifier optional, so you can run it
>on your whole document before you output it, if you want. True, many people
>wouldn't run it, but at some point we've got to trust the programmer.
>Besides, usually the worst that happens is that the programmer will discover
>the error as soon as the document is parsed, which almost always isn't too
>much later. It's only the uncommon case where element and attribute names
>are not fixed (e.g. they come from user input), that this might actually
>catch runtime bugs. In that case, I think we have to trust the programmer to
>verify the input themselves, or to run the JDOM verifier on the document
>before outputting it. Otherwise you're making everyone pay for something
>that will only benefit a minority of programmers.
>

I think verification is a big help for the developer and an even 
bigger help for the person who has to consume what the JDOM developer 
produces. The whole spirit of XML is that it simply does not allow 
malformedness at any time. And developers can't be trusted to 
maintain well-formedness in their code anymore than they can be 
trusted to properly allocate and free memory (i.e. not at all.) I 
constantly see developers trying to break well-formedness. In another 
thread on this very list today, someone's trying to figure out how to 
work with a document that's missing its end-tag. I myself, who really 
should know better, have published public XML applications that ran 
for months while generating malformed documents before I noticed the 
problem. That was because I was producing the XML manually rather 
than using a clean API like JDOM that would have caught my problem 
immediately. Verification is essential for an API that claims to be 
able to generate XML.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|             http://www.cafeconleche.org/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+