[jdom-interest] Building from file with UTF-8 extended characters
Jason Hunter
jhunter at acm.org
Mon Nov 19 21:04:46 PST 2001
Count the characters using getText().length(). That will let you know
if it's being stored as a single Unicode char or not.
After output the document will be in UTF-8 again so you'll see the char
converted to the 2-byte sequence, so seeing two bytes there doesn't mean
it failed.
-jh-
Fred Clewis wrote:
>
> I have an XML file with encoding="UTF-8" that is mostly one byte ASCII but
> has one element text value that is a two-byte UTF-8 character X"C595".
> When I use "jdom b7 (+ recent CVS update), xerces 2.0 beta 2" to read the
> file in and build a document with code like:
>
> SAXBuilder builder = new SAXBuilder();
> builder.setFeature("http://apache.org/xml/features/allow-java-encodings",
> true);
> builder.setValidation(false);
> Document doc = builder.build(new FileInputStream(xmlFile));
>
> I would expect the parser to change the 2 byte UTF-8 character, X"C595", to
> it's unicode equivelent, X"0155".
> Is that right? How could I verify it's unicode value in java? I need to
> build and MQSeries message with it. At the moment, I am not sure if I want
> the MQSeries to be in Unicode or UTF-8 form, but I think I am not parsing
> it in correctly.
>
> Attempts to write it out before sending with:
>
> 1. doc.toString()
> or
> 2. ByteArrayOutputStream baos = new ByteArrayOutputStream();
> XMLOutputter xmlOut = new XMLOutputter();
> xmlOut.output(doc, baos);
> return baos.toString();
>
> seem to indicate it is still treated as two seperate bytes "C5" "95"
>
> thanks for any ideas,
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
More information about the jdom-interest
mailing list