[jdom-interest] Substituting a different <!DOCTYPE ...> when parsing an XML file

Geoff Rimmer geoff.rimmer at sillyfish.com
Thu May 30 09:36:05 PDT 2002


When parsing XML files with validation switched on, I think that in
95% of cases, it should be the *application* rather than the XML file
that specifies which DTD file to validate against.

For example, suppose we have an application that reads in a list of
countries from a file (countries.xml):

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE countries SYSTEM "http://www.sillyfish.com/countries.dtd">

    <countries>
      <country name="France" />
      <country name="Belgium" />
      <country name="Italy" />
      <country name="Germany" />
    </countries>

where countries.dtd looks like this:

    <!ELEMENT countries (country*)>
    <!ELEMENT country EMPTY>
    <!ATTLIST country
        name CDATA #REQUIRED>

then our application can read in the data (with validation) as follows:

    // NOTE: validation switched on by using "true"
    for ( Iterator iter = new SAXBuilder( true )
            .build( new FileInputStream( "countries.xml" ) )
            .getRootElement().getChildren( "country" ); iter.hasNext(); )
    {
        Element e = (Element)iter.next();
        System.out.println(
            "Found country: " + e.getAttributeValue( "name" ) );
    }

and we would get the following output:

    Found country: France
    Found country: Belgium
    Found country: Italy
    Found country: Germany

No problem here.  But there's nothing to stop us sneaking in an
alternate "countries.xml" file which is in a completely different
format to that expected; for example suppose the contents of
countries.xml were as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE shops SYSTEM "http://www.sillyfish.com/shops.dtd">

    <shops>
      <shop>
        <name>McDonalds</name>
        <sells>Burgers</sells>
      </shop>
      <shop>
        <name>KFC</name>
        <sells>Chicken</sells>
      </shop>
    </shops>

where shops.dtd looks like this:

    <!ELEMENT shops (shop*)>
    <!ELEMENT shop (name,sells)>
    <!ELEMENT name (#PCDATA)>
    <!ELEMENT sells (#PCDATA)>

Now, when we read in *this* version of countries.xml (which now
contains details of shops not countries) using JDOM, the validation
still does not fail.  This is because the XML file has specified a DTD
which it successfully validates against (even though it describes a
document about shops, not countries).

The Java code above will simply determine that the file contains no
<country> elements, and will therefore output nothing.  In fact it
would be totally unaware that a junk XML file had just been parsed.

What would be nice is if the Java code could specify which DTD to
validate against.  For example if SAXBuilder had a new build() method
with an extra parameter of type DocType:

    /**
     * This builds a document from the supplied input stream,
     * replacing any <!DOCTYPE...> with the specified docType.
     *
     * If validation has been switched on, the document will
     * be validated against the specified docType irrespective
     * of what DOCTYPE is specified in the file.
     **/
    public Document build( InputStream in, DocType docType );

then we could write:

    DocType docType = new DocType(
        "countries", "http://www.sillyfish.com/countries.dtd" );

    for ( Iterator iter = new SAXBuilder( true )
            .build( new FileInputStream( "countries.xml" ), docType )
            .getRootElement().getChildren( "country" ); iter.hasNext(); )
    {
        Element e = (Element)iter.next();
        System.out.println(
            "Found country: " + e.getAttributeValue( "name" ) );
    }

and any bogus files would be rejected, as they would be forced to
validate against "countries.dtd" - whether they liked it or not.

Of course this would require SAXBuilder to be a bit clever when
reading in the XML file, as it would need to remove any DOCTYPE it
finds and insert a new one generated from the supplied DocType object.

My question is: what is the best way of doing this?  Is it just to
provide a utility filter class which replaces a file's DOCTYPE with
another:

    class DocTypeReplacerInputStream extends FilterInputStream
    {
        public DocTypeReplacerInputStream( InputStream is, DocType docType )
        {
            ...
        }

        public int read() throws IOException
        {
            ...
        }
    }

which you would use as follows:

    DocType docType = new DocType(
        "countries", "http://www.sillyfish.com/countries.dtd" );

    Document doc = new SAXBuilder( true ).build(
        new DocTypeReplacerInputStream(
            new FileInputStream( "countries.xml" ) ) );

    (I've written a very basic DocTypeReplacerInputStream class, but
    it is not exactly what you'd call sophisticated and it probably
    doesn't work with many XML files that could be thrown at it.  Has
    anyone else written a similar class?).

or is it to do as suggested above, namely to have extra build()
methods in SAXBuilder so that a replacement DocType can be specified
(and then make the DocTypeReplacerInputStream functionality part of
SAXBuilder)?

Personally I think it would be such a useful thing to have that it
*should* be part of SAXBuilder (as I have described above), but I
would be interested to hear what other people think.

-- 
Geoff Rimmer <> geoff.rimmer at sillyfish.com <> www.sillyfish.com
www.sillyfish.com/phone - Make savings on your BT and Telewest phone calls
UPDATED 09/05/2002: 508 destinations, 12 schemes (with contact details)



More information about the jdom-interest mailing list