[jdom-interest] Building from file with UTF-8 extended characters
Fred Clewis
clewisf at us.ibm.com
Thu Nov 15 07:34:31 PST 2001
I have an XML file with encoding="UTF-8" that is mostly one byte ASCII but
has one element text value that is a two-byte UTF-8 character X"C595".
When I use "jdom b7 (+ recent CVS update), xerces 2.0 beta 2" to read the
file in and build a document with code like:
SAXBuilder builder = new SAXBuilder();
builder.setFeature("http://apache.org/xml/features/allow-java-encodings",
true);
builder.setValidation(false);
Document doc = builder.build(new FileInputStream(xmlFile));
I would expect the parser to change the 2 byte UTF-8 character, X"C595", to
it's unicode equivelent, X"0155".
Is that right? How could I verify it's unicode value in java? I need to
build and MQSeries message with it. At the moment, I am not sure if I want
the MQSeries to be in Unicode or UTF-8 form, but I think I am not parsing
it in correctly.
Attempts to write it out before sending with:
1. doc.toString()
or
2. ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLOutputter xmlOut = new XMLOutputter();
xmlOut.output(doc, baos);
return baos.toString();
seem to indicate it is still treated as two seperate bytes "C5" "95"
thanks for any ideas,
More information about the jdom-interest
mailing list