[jdom-interest] UTF8 charset issues...
Patrick JUSSEAU
patrick at openbase.com
Fri Oct 10 11:08:17 PDT 2003
Alex,
Well I am pretty sure it is not working because if I save my XML
document and then I try to read it back in my java app I get the
following exception:
java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
....
The scenario to get this exception is:
1 - Create a jdom Document and call element.setText("Æ") to set an
element's text value
2 - Save this Document (ie create a local XML file) test.xml
3 - Read this XML document back which leads to the above exception.
Note: If I use a XML aware tool like oxygen to look at test.xml, the
'Æ' character shows up as '�'
However if I save my document using:
String text = "Æ";
byte[] bytes = text.getBytes("UTF8");
text = new String(bytes);
setText(text);
In that case my document is properly saved and I am able to read it
back in my Java app
I am using Java 1.4.1 on MacOSX
Thanks again
Patrick
On 10 Oct 2003, at 6:34 PM, Alex Rosen wrote:
> "just calling Element.setText("Æ") does not generate a correct UTF-8
> encoded document."
>
> How did you determine this? I.e. what tool did you use to look at the
> document? What I'm getting at is, I think that the document was right,
> but the tool you used to look at it made it look "wrong". Realize that
> the *bytes* of the UTF-8 encoding of Æ are going to look like garbage
> characters. If you view the file using a tool that uses any encoding
> other than UTF-8, it'll look mangled, even though it's not. The viewer
> you used (e.g. maybe Notepad or another text editor) probably read it
> using your machine's default encoding (such as Latin 1), so it looked
> garbled even though it was really OK (i.e. if your viewer used UTF-8
> to show it to you, it would be fine.)
>
> Encoding issues are really confusing, unfortunately.
>
> Alex
>
>>>> Patrick JUSSEAU <patrick at openbase.com> 10/10/2003 8:35:20 AM >>>
> Hi all,
>
> I am trying to understand how jdom handles character encodings. Here is
> what I am doing:
>
> I have a java app which reads data from a xml file (UTF-8 encoded). I
> am able to get text just fine using
> String str = anElement.getText();
>
> The resulting str string (Unicode encoded) contains exactly what was
> defined in my xml file. The charset translation is here transparent for
> me. For example if my xml document is:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE DOCUMENT SYSTEM "annonce.dtd">
> <DOCUMENT>
> <TEXT>Æ</TEXT>
> </DOCUMENT>
>
> I get Æ in my str string.
>
>
> However when I am trying to generate a xml document with this exact
> same Æ value, just calling Element.setText("Æ") does not generate a
> correct UTF-8 encoded document. I have first to manually do this in my
> code:
> String text = "Æ";
> try{
> byte[] bytes = text.getBytes("UTF8");
> String newText = new String(bytes);
> setText(newText);
> }catch(UnsupportedEncodingException uee){
> uee.printStackTrace();
> }
>
> Why do I have to do this for the xml generation to work. Why isn't jdom
> taking care of the charset translation for me since the resulting file
> has UTF-8 encoding specified in it?
>
> Thanks for any help
>
> Patrick
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/
> youraddr at yourhost.com
>
More information about the jdom-interest
mailing list