[jdom-interest] Resolving Entities...when no DTD is assigned (not DOCTYPE declaration) in XML

Per Norrman per.norrman at austers.se
Thu Sep 1 15:44:24 PDT 2005


Hi,

Funny, I know I had something around deep down in the tool chest ....

Anyways, a little trick with entities:

Get this file: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
and store it in a convenient location, call it <ENTITYFILE>.

When you want to parse a file that you know contains undeclared
entitities from this set (could be others, but you should get
the general idea), the trick is to wrap that file in another
xml file that actually declares these entities. This is how
I solved it, on the prerequisite that the stuff I wanted to
parse was accessible as a File. But that is certainly doable
in other cases, too.

     static String template =
               "<!DOCTYPE x [ "
             + "<!ENTITY  % entities SYSTEM \"{0}\"> "
             + "<!ENTITY file SYSTEM \"{1}\" >" + "%entities;" + "]>"
             + "<x>&file;</x>";

     private static String createFromTemplate(File entityFile, File xmlFile) {
         return MessageFormat.format(template, entityFile.getAbsolutePath(),
                 xmlFile.getAbsolutePath());
     }

     private static File entityFile = new File(<ENTITYFILE>);

     public Document parse(File f) throws JDOMException, IOException {
         String xml = createFromTemplate(entityFile, f);
         SAXBuilder builder = new SAXBuilder();
         Document doc = builder.build(new StringReader(xml));
         Element e = (Element) doc.getRootElement().getChildren().get(0);
         e.detach();
         Document doc2 = new Document(e);
         return doc2;
     }

So you just call

     Document doc = xxx.parse(new File(<whateverpathyouhave>));

Enjoy!

/pmn



Vish D. skrev:
> Hello all,
> 
> I am having some trouble figuring out how to go about resolving entities 
> when an XML file doesn't have DOCTYPE declaration (no DTD attached to 
> it), but contains entities that are 'non-standarad' (such as, '&nbsp;', 
> etc...). I need to do this in such a way that I don't change the XML 
> file (without added DOCTYPE declaration, etc..).
> 
> My need for the above is as follows:
> 
> SAXBuilder builder = new SAXBuilder();
> ....
> fulltextXML = builder.build(new FileInputStream(filename));
> 
> -- fails with an exception ---
> 
> C:\HTMLs\00063185_200_1_67\00063185_200_1_67_Document.xml is not 
> well-formed.
> org.jdom.input.JDOMParseException: Error on line 5: The entity "nbsp" 
> was referenced, but not declared.
> Error on line 5: The entity "nbsp" was referenced, but not declared.
> 
> 
> Is there a way to resolve such entities, without having to declare the 
> DOCTYPE in the XML file?
> 
> 
> 
> Thanks in advance!
> 
> Vish
> 
> 
> Sample XML file:
> 
> XML FILE
> --------------
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <object_document>
>     <art_title>        Muscular Alteration of Gill Geometry in vitro: 
> Implications for Bivalve Pumping Processes -- Medler and Silverman 200 
> (1): 77 -- The Biological Bulletin</art_title>
>     <converted_from type='HTML'>BiolBull V 200 I 1 P 77 Fulltext 
> 00063185.htm</converted_from>
>     <fulltext>&nbsp;Biol. Bull.  200: 77-86. (February 2001)&#169; 2001 
> Marine Biological LaboratoryMuscular Alteration of Gill Geometry in 
> vitro: Implications for Bivalve Pumping ProcessesScott Medler* and 
> Harold SilvermanLouisiana State University, Baton Rouge, Louisiana 
> 70803* Author to whom correspondence should be addressed. Current 
> address: Department of Biology, Colorado State University, Ft. Collins, 
> CO 80523. E-mail: Skmedler{at}aol.com<!-- var u = "Skmedler", d = 
> "aol.com <http://aol.com>"; document.getElementById("em0").innerHTML = 
> "" + u + "@" + d + ""//-->
> &nbsp;Received 23 March 2000; accepted 19 October 2000.
> </fulltext>
>     <jrnl_title>BiolBull</jrnl_title>
>     <issn>00063185</issn>
>     <volume>200</volume>
>     <issue>1</issue>
>     <fpage>77</fpage>
> </object_document>
> 
> 
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com



More information about the jdom-interest mailing list