[jdom-interest] UTF8 charset issues...

Patrick JUSSEAU patrick at openbase.com
Mon Oct 13 01:58:05 PDT 2003


Alex,

Thank you very much for your help. I followed your instructions and now 
it is working perfectly as expected. I specified the UTF-8 encoding 
using an OutputStreamWriter on a FileOutputStream.

Patrick

// Warning: When outputting to a Writer, make sure the writer's 
encoding matches
// the encoding setting in the XMLOutputter. This ensures the encoding 
in which
// the content is written (controlled by the Writer configuration) 
matches the
// encoding placed in the document's XML declaration (controlled by the 
XMLOutputter).
// Because a Writer cannot be queried for its encoding, the information 
must be passed
// to the XMLOutputter manually in its constructor or via the 
setEncoding() method.
// The default XMLOutputter encoding is UTF-8.


On 10 Oct 2003, at 10:54 PM, Alex Rosen wrote:

> JDOM doesn't do this itself, it just uses the standard Java Reader and 
> Writer mechanism. If gave XMLOutputter an OutputStream, then it will 
> use the correct Writer automatically, but since you're giving it a 
> Writer yourself it can't do this. The JavaDoc for XMLOutputter 
> mentions this. So always use streams instead of readers or writers 
> when using XML, if possible.
>
> Alex
>
>>>> Patrick JUSSEAU <patrick at openbase.com> 10/10/2003 3:52:01 PM >>>
> Alex,
>
> Thanks again for you help.
> Here are 2 methods I am using to save my Document to a file
>
>
> public void writeDOM(Document document) throws IOException{
> 	    XMLOutputter xmlOutputter = new XMLOutputter();
> 	    xmlOutputter.setNewlines(true);
> 	    xmlOutputter.output(document, getWriter());
>      }
>
> AND
>
> protected Writer getWriter() throws IOException{
> 		File file = new File(.....);
> 	    FileWriter FileWriter = new FileWriter(file);
> 	    return new BufferedWriter(FileWriter);
>      }
>
>
> So if I understand correctly what you are saying is that I should
> specify the encoding the Writer should use? I thought that jdom would
> perform the encoding translation according to the XML file encoding
> type transparently for me. Isn't that what jdom is doing when reading
> from a XML file? I mean I didn't have to specify any encoding type when
> reading from the XML. What is your thought about it?
>
> Patrick
>
>
> On 10 Oct 2003, at 8:54 PM, Alex Rosen wrote:
>
>> OK, next theory: how are you saving the file? Are you using a Writer
>> or an OutputStream? If you're using a Writer are you setting it to use
>> UTF-8?
>>
>> Alex
>>
>>>>> Patrick JUSSEAU <patrick at openbase.com> 10/10/2003 2:08:17 PM >>>
>> Alex,
>>
>> Well I am pretty sure it is not working because if I save my XML
>> document and then I try to read it back in my java app I get the
>> following exception:
>>
>> java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8
>> sequence.
>>          at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
>> Source)
>>          at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
>>          at org.apache.xerces.impl.XMLEntityScanner.load(Unknown
>> Source)
>> ....
>>
>> The scenario to get this exception is:
>>
>> 1 - Create a jdom Document and call element.setText("Æ") to set an
>> element's text value
>>
>> 2 - Save this Document (ie create a local XML file) test.xml
>>
>> 3 - Read this XML document back which leads to the above exception.
>>
>>
>> Note: If I use a XML aware tool like oxygen to look at test.xml, the
>> 'Æ' character shows up as '*'
>> However if I save my document using:
>> String text = "Æ";
>> byte[] bytes = text.getBytes("UTF8");
>> text = new String(bytes);
>> setText(text);
>>
>>
>> In that case my document is properly saved and I am able to read it
>> back in my Java app
>>
>> I am using Java 1.4.1 on MacOSX
>>
>> Thanks again
>>
>> Patrick
>>
>>
>>
>> On 10 Oct 2003, at 6:34 PM, Alex Rosen wrote:
>>
>>> "just calling Element.setText("Æ") does not generate a correct UTF-8
>>> encoded document."
>>>
>>> How did you determine this? I.e. what tool did you use to look at the
>>> document? What I'm getting at is, I think that the document was 
>>> right,
>>> but the tool you used to look at it made it look "wrong". Realize 
>>> that
>>> the *bytes* of the UTF-8 encoding of Æ are going to look like garbage
>>> characters. If you view the file using a tool that uses any encoding
>>> other than UTF-8, it'll look mangled, even though it's not. The 
>>> viewer
>>> you used (e.g. maybe Notepad or another text editor) probably read it
>>> using your machine's default encoding (such as Latin 1), so it looked
>>> garbled even though it was really OK (i.e. if your viewer used UTF-8
>>> to show it to you, it would be fine.)
>>>
>>> Encoding issues are really confusing, unfortunately.
>>>
>>> Alex
>>>
>>>>>> Patrick JUSSEAU <patrick at openbase.com> 10/10/2003 8:35:20 AM >>>
>>> Hi all,
>>>
>>> I am trying to understand how jdom handles character encodings. Here
>>> is
>>> what I am doing:
>>>
>>> I have a java app which reads data from a xml file (UTF-8 encoded). I
>>> am able to get text just fine using
>>> String str = anElement.getText();
>>>
>>> The resulting str string (Unicode encoded) contains exactly what was
>>> defined in my xml file. The charset translation is here transparent
>>> for
>>> me. For example if my xml document is:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <!DOCTYPE DOCUMENT SYSTEM "annonce.dtd">
>>> <DOCUMENT>
>>>      <TEXT>Æ</TEXT>
>>> </DOCUMENT>
>>>
>>> I get Æ in my str string.
>>>
>>>
>>> However when I am trying to generate a xml document with this exact
>>> same Æ value, just calling Element.setText("Æ") does not generate a
>>> correct UTF-8 encoded document. I have first to manually do this in 
>>> my
>>> code:
>>> 		String text = "Æ";
>>> 		try{
>>> 			byte[] bytes = text.getBytes("UTF8");
>>> 			String newText = new String(bytes);
>>> 			setText(newText);
>>> 		}catch(UnsupportedEncodingException uee){
>>> 			uee.printStackTrace();
>>> 		}
>>>
>>> Why do I have to do this for the xml generation to work. Why isn't
>>> jdom
>>> taking care of the charset translation for me since the resulting 
>>> file
>>> has UTF-8 encoding specified in it?
>>>
>>> Thanks for any help
>>>
>>> Patrick
>>>
>>> _______________________________________________
>>> To control your jdom-interest membership:
>>> http://lists.denveronline.net/mailman/options/jdom-interest/
>>> youraddr at yourhost.com
>>>
>>
>>
>




More information about the jdom-interest mailing list