[jdom-interest] Special Character woes

Thu Feb 8 10:25:29 PST 2001

I have a bunch of XML documents that all contain special character
references like &#x2022;. If I set the encoding to UTF-8, and I parse it,
then subsequently output it, all the special characters get replaced by
their corresponding unicode characters (for instance, the above character
reference gets converted to a bullet).  If I then re-parse the same XML
file, the SAXBuilder complains that an unrecognized unicode character is
being used.  So by outputting using XMLOutputter, I can no longer parse the
document using JDOM.

If I set the encoding to us-ascii or ISO-8859-1, even stranger things
happen.  The first time it parses and outputs the XML file, the character
references are converted to actual characters, as before.  Now, when I parse
the same file I no longer get SAXBuilder errors, but when I output it again
with XMLOutputter, all those characters get converted to question marks!

So it seems that there is no way to parse and output a file more than once
using JDOM (a file that uses character references, anyway).

-Is there a way to keep &#xnnnn; from being collapsed to a character using
UTF-8?
-Is there a way to keep it from converting unicode characters to question
marks using ISO-8859-1 or us-ascii?

Any other suggestions?

Thanks much,
Matt Bridges