[jdom-interest] Charset conversion problem

Fri Oct 3 19:24:35 PDT 2003

le 4/10/03 3:43, Eric VERGNAUD à eric.vergnaud at wanadoo.fr a écrit :

OK, forget about this. After digging a little deeper, it appeared that the
bug came later. As a matter of fact, I thought String.getBytes() was
returning UTF-8 data, while it's returning data in the platform charset.

Sorry for the noise.

> Hi JDOM,
> 
> I don't know if this is a bug or a setting I cannot find. I'm running JDOM
> b-9 on MacOSX.
> 
> From a byte stream, I receive a xml document with some accented characters
> in it. For example:
> 
> <record>
>   États-unis
> </record>
> 
> The above character É is properly encoded in UTF-8 as bytes C3 89 which
> decode to C9 or 201 which is indeed the Unicode value for that character.
> 
> However when I parse the document, and then get the text from the element,
> It appears that the 201 has turned into 131 which happens to be the code for
> É in the MacOS latin charset.
> 
> So it looks like the element data is converted to the platform charset
> rather than unicode.
> 
> I hope I'm simply missing something. Here is how I parse the data:
> 
> byte[] data; // comes from elsewhere (at this point the bytes are C389)
> InputStream is = new ByteArrayInputStream(data);
> SAXBuilder sax = new SAXBuilder();
> sax.setIgnoringElementContentWhitespace(true);
> Document received = sax.build(is);
> // when I get there the character is 131 instead of 201
> 
> Any advice will be appreciated,
> 
> Eric
> 
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.
> com