[jdom-interest] Internal DTD subset verification

Elliotte Rusty Harold elharo at metalab.unc.edu
Tue May 7 08:38:41 PDT 2002


At 11:04 AM -0400 5/7/02, Alex Rosen wrote:
>>  >It would mean *gasp* that there exists the possibilty of a jdom
>>  >document living in
>>  >memory that could not be produced as an xml document.
>>
>>  This possibility is a major flaw in XInclude and a few other
>>  technologies now. It is causing implementors and users problems
>>  *today*.
>
>Can you give some examples?
>

The namespace mappings in scope get thoroughly mucked up when pieces 
of one document are included into another, sometimes in ways that no 
possible collection of xmlns attributes can produce.

The same entity and notation names can be mapped to different things 
in different parts of the document. However, they can only be defined 
once per name in the DTD.

Consequently it can be impossible to serialize these constructed 
infosets to save them or pass them to other processes and APIs. For 
instance, what comes out of a DOM XInclude process may break when fed 
into a SAX or JDOM or XSLT process and vice versa. The serialized 
form of a document is the only thing the different APIs and 
environments have in common. When there is no longer a possible 
serialized form, you lose interoperability between systems.

>>  If we let it into JDOM, it will cause JDOM problems too. We
>>  really, really do not want to allow this.
>
>I've never figured out what real-world problems would occur more than
>rarely, if we were less than 100% perfect in our well-formedness checking.
>As I mentioned, that doesn't seem to have slowed down DOM (or Xerces or
>Crimson). The user of JDOM must take some responsibility for writing a
>correct program. You want to protect them from their own XML ignorance, as
>well as protecting you and me from having to deal with the result of that
>ignorance. But even if we stamp out malformed documents, the user can still
>create invalid documents, or even valid but semantically nonsensical
>documents. We can't protect everyone from everything. Why draw this
>particular line in the sand? The line with most APIs is, protect users from
>themselves when it's cheap but not when it's expensive, because otherwise
>they'll use a different API. That's the line I would draw. If character
>verification slows JDOM down by 10% or more in a benchmark, that'll make
>some percentage of people not use it, which is just counterproductive.
>

We draw the line in the sand at well-formedness because that's the 
minimum requirement for interoperability.

The failure to do 100% character checking has caused problems for 
users of DOM. About once a month there's a post on xml-dev from 
somebody somewhere trying to figure out why their XML document's 
binary data fails when passed to some tool that's more strict than 
they're accustomed to.

FYI, XML 1.1, if adopted (which I hope it isn't) would make character 
checking somewhat more efficient because the rules are a lot simpler.

>Well, the XML declaration is not modeled by a JDOM Document, and the
>whitespace between attributes is not modeled by JDOM at all. And this isn't
>our fault even, it's SAX's fault. (I'm not sure if this is actually relevent
>to the discussion or if I'm just being nit-picky).
>

You're being nit-picky. :-)  It's mostly an artifact of serialization 
rather than the actual character sequence. e.g. you can change a 
document from Latin-1 to UTF-8 without changing the information 
content of the document. It's a 1-1 onto mapping. The XML spec is 
specific that white space between attributes  and in tags doesn't 
matter either.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|             http://www.cafeconleche.org/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+



More information about the jdom-interest mailing list