[jdom-interest] Entity expansion - override or pass-through?

philip.nelson at omniresources.com philip.nelson at omniresources.com
Thu Oct 11 11:40:58 PDT 2001


This is a known problem to which a I posted a possible, but somewhat painful
solution in the last couple of weeks.  The parser (not JDOM) expands
character entities to the unicode value regardless of the setting of the
dtdhandler.  

You should be able to get around this however by simply using an external
dtd.  Since your parameter entities are already using external files, I
would suggest that it should be pretty painless to move the whole dtd to an
external file.  Then SAX won't have to parse the character entities in the
dtdhandler and you should be OK.

Of course, I can't say that I've tried it....


-----Original Message-----
From: Steven Sodt [mailto:steve.sodt at rowe.com]
Sent: Wednesday, October 10, 2001 4:08 PM
To: jdom-interest at jdom.org
Subject: [jdom-interest] Entity expansion - override or pass-through?
Importance: High


Perhaps a newbie question, but here goes.  Is there a means of passing
entity reference text (e.g. "β", "δ") unresolved into non-unicode
(e.g. ISO-8859-1, ASCII, or Cp1250) text output that will ultimately be
incorporated into HTML?  The intent on the output side is to give the
browser a chance on it's own to render the corresponding character.

The application involved currently handles parsing of UTF-8 encoded xml
files using a PubMed.dtd for validation and utilizes or variously references
19 external ISO... character entity files in the process.  It has no problem
writing out browser-renderable UTF-8 files or text, but because we're not
able to specify the encoding in the HTTP header or in HTML META tags in the
destination application, passing the entity references though unaltered
seems the best option.  

The problem appears to be that simply setting .setExpandEntities() to false
results in the references being stripped from the output.  Altering the
replacement character reference(s) in the external entity reference files to
reflect the entity name (replace "Δ" with "Δ") results in the
parser generating a "recursive reference" error.

Any and all suggestions are welcome.  And thanks in advance for any
assistance rendered.

    //Load XML into JDOM document
    SAXBuilder builder = new
SAXBuilder("org.apache.xerces.parsers.SAXParser");
    builder.setValidation(true);
    builder.setExpandEntities(false);

    Document doc = builder.build(new FileInputStream(parseFile), docType);

    <!ENTITY % ISOlat1  PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"
"ISOlat1"> %ISOlat1;
    <!ENTITY % ISOlat2  PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN"
"ISOlat2"> %ISOlat2;            
    ......
    <!ENTITY Delta      "&#916;"   ><!--U0394 =capital Delta, Greek -->
    <!ENTITY epsi       "&#949;"   ><!--U03B5 =small epsilon, Greek -->
    <!ENTITY epsis      "&#949;"   ><!--U03B5 /straightepsilon -->
    ......

Steven Sodt
RoweCom, Inc. / Information Quest
steve.sodt at rowe.com
781-329-3350 x 3503



More information about the jdom-interest mailing list