[jdom-interest] Entity expansion - override or pass-through?

Steven Sodt steve.sodt at rowe.com
Fri Oct 12 06:32:22 PDT 2001


Philip, et al. -

Thank you and I will seek out the solution you posted.  Whether invoking
Xerces 1.4.3 (                SAXBuilder builder = new
SAXBuilder("org.apache.xerces.parsers.SAXParser");) made any difference in
this I don't know, but I realized belatedly that with expansion shut off,
entity references perhaps weren't disappearing, but that I simply hadn't
done the work to capture them.  Invoking .getContent() on the respective
element and then assembling a string from the resulting list of objects
seems to work.  I have to think, though, that there is perhaps a better way
of tackling this than what follows vis a vis checking the identity of each
object in the list returned by .getContent() in order to obtain the
corresponding string:

		...
            Object item = (Object)ci.next();
            Class next = item.getClass();
            String whatami = next.getName();
            if(whatami.equals("org.jdom.EntityRef")){
                EntityRef er = (EntityRef)item;
                item = er.getName();
		...

Thank you again and best regards,

Steven Sodt

-----Original Message-----
From: jdom-interest-admin at jdom.org
[mailto:jdom-interest-admin at jdom.org]On Behalf Of
philip.nelson at omniresources.com
Sent: Thursday, October 11, 2001 2:41 PM
To: steve.sodt at rowe.com; jdom-interest at jdom.org
Subject: RE: [jdom-interest] Entity expansion - override or
pass-through?


This is a known problem to which a I posted a possible, but somewhat painful
solution in the last couple of weeks.  The parser (not JDOM) expands
character entities to the unicode value regardless of the setting of the
dtdhandler.

You should be able to get around this however by simply using an external
dtd.  Since your parameter entities are already using external files, I
would suggest that it should be pretty painless to move the whole dtd to an
external file.  Then SAX won't have to parse the character entities in the
dtdhandler and you should be OK.

Of course, I can't say that I've tried it....


-----Original Message-----
From: Steven Sodt [mailto:steve.sodt at rowe.com]
Sent: Wednesday, October 10, 2001 4:08 PM
To: jdom-interest at jdom.org
Subject: [jdom-interest] Entity expansion - override or pass-through?
Importance: High


Perhaps a newbie question, but here goes.  Is there a means of passing
entity reference text (e.g. "β", "δ") unresolved into non-unicode
(e.g. ISO-8859-1, ASCII, or Cp1250) text output that will ultimately be
incorporated into HTML?  The intent on the output side is to give the
browser a chance on it's own to render the corresponding character.

The application involved currently handles parsing of UTF-8 encoded xml
files using a PubMed.dtd for validation and utilizes or variously references
19 external ISO... character entity files in the process.  It has no problem
writing out browser-renderable UTF-8 files or text, but because we're not
able to specify the encoding in the HTTP header or in HTML META tags in the
destination application, passing the entity references though unaltered
seems the best option.

The problem appears to be that simply setting .setExpandEntities() to false
results in the references being stripped from the output.  Altering the
replacement character reference(s) in the external entity reference files to
reflect the entity name (replace "Δ" with "Δ") results in the
parser generating a "recursive reference" error.

Any and all suggestions are welcome.  And thanks in advance for any
assistance rendered.

    //Load XML into JDOM document
    SAXBuilder builder = new
SAXBuilder("org.apache.xerces.parsers.SAXParser");
    builder.setValidation(true);
    builder.setExpandEntities(false);

    Document doc = builder.build(new FileInputStream(parseFile), docType);

    <!ENTITY % ISOlat1  PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"
"ISOlat1"> %ISOlat1;
    <!ENTITY % ISOlat2  PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN"
"ISOlat2"> %ISOlat2;
    ......
    <!ENTITY Delta      "&#916;"   ><!--U0394 =capital Delta, Greek -->
    <!ENTITY epsi       "&#949;"   ><!--U03B5 =small epsilon, Greek -->
    <!ENTITY epsis      "&#949;"   ><!--U03B5 /straightepsilon -->
    ......

Steven Sodt
RoweCom, Inc. / Information Quest
steve.sodt at rowe.com
781-329-3350 x 3503
_______________________________________________
To control your jdom-interest membership:
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhos
t.com




More information about the jdom-interest mailing list