No subject
Fri Aug 6 17:04:17 PDT 2004
<record>
=C9tats-unis
</record>
The above character =C9 is properly encoded in UTF-8 as bytes C3 89 which
decode to C9 or 201 which is indeed the Unicode value for that character.
However when I parse the document, and then get the text from the element,
It appears that the 201 has turned into 131 which happens to be the code fo=
r
=C9 in the MacOS latin charset.
So it looks like the element data is converted to the platform charset
rather than unicode.
I hope I'm simply missing something. Here is how I parse the data:
byte[] data; // comes from elsewhere (at this point the bytes are C389)
InputStream is =3D new ByteArrayInputStream(data);
SAXBuilder sax =3D new SAXBuilder();
sax.setIgnoringElementContentWhitespace(true);
Document received =3D sax.build(is);
// when I get there the character is 131 instead of 201
Any advice will be appreciated,
Eric
More information about the jdom-interest
mailing list