[jdom-interest] Charset conversion problem
Eric VERGNAUD
eric.vergnaud at wanadoo.fr
Fri Oct 3 19:24:35 PDT 2003
le 4/10/03 3:43, Eric VERGNAUD à eric.vergnaud at wanadoo.fr a écrit :
OK, forget about this. After digging a little deeper, it appeared that the
bug came later. As a matter of fact, I thought String.getBytes() was
returning UTF-8 data, while it's returning data in the platform charset.
Sorry for the noise.
> Hi JDOM,
>
> I don't know if this is a bug or a setting I cannot find. I'm running JDOM
> b-9 on MacOSX.
>
> From a byte stream, I receive a xml document with some accented characters
> in it. For example:
>
> <record>
> États-unis
> </record>
>
> The above character É is properly encoded in UTF-8 as bytes C3 89 which
> decode to C9 or 201 which is indeed the Unicode value for that character.
>
> However when I parse the document, and then get the text from the element,
> It appears that the 201 has turned into 131 which happens to be the code for
> É in the MacOS latin charset.
>
> So it looks like the element data is converted to the platform charset
> rather than unicode.
>
> I hope I'm simply missing something. Here is how I parse the data:
>
> byte[] data; // comes from elsewhere (at this point the bytes are C389)
> InputStream is = new ByteArrayInputStream(data);
> SAXBuilder sax = new SAXBuilder();
> sax.setIgnoringElementContentWhitespace(true);
> Document received = sax.build(is);
> // when I get there the character is 131 instead of 201
>
> Any advice will be appreciated,
>
> Eric
>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.
> com
More information about the jdom-interest
mailing list