[jdom-interest] Special Character woes
Matt Bridges
matt at kurzweiltech.com
Thu Feb 8 10:25:29 PST 2001
I have a bunch of XML documents that all contain special character
references like •. If I set the encoding to UTF-8, and I parse it,
then subsequently output it, all the special characters get replaced by
their corresponding unicode characters (for instance, the above character
reference gets converted to a bullet). If I then re-parse the same XML
file, the SAXBuilder complains that an unrecognized unicode character is
being used. So by outputting using XMLOutputter, I can no longer parse the
document using JDOM.
If I set the encoding to us-ascii or ISO-8859-1, even stranger things
happen. The first time it parses and outputs the XML file, the character
references are converted to actual characters, as before. Now, when I parse
the same file I no longer get SAXBuilder errors, but when I output it again
with XMLOutputter, all those characters get converted to question marks!
So it seems that there is no way to parse and output a file more than once
using JDOM (a file that uses character references, anyway).
-Is there a way to keep &#xnnnn; from being collapsed to a character using
UTF-8?
-Is there a way to keep it from converting unicode characters to question
marks using ISO-8859-1 or us-ascii?
Any other suggestions?
Thanks much,
Matt Bridges
More information about the jdom-interest
mailing list