[jdom-interest] Special characters not being encoded as UTF-8

Robert Herold rherold at xetus.com
Fri Mar 31 10:59:55 PST 2006


The solution was to use the proper Charset when sending out the xml.

String xmlAsString; // Holds the xml text to be sent
OutputStream outStream; // stream to wherever xml is being sent

BufferedWriter out =
   new BufferedWriter(
      new OutputStreamWriter(
         outStream,
         Charset.forName("UTF-8")));

out.write(xmlAsString);

Thanks for leading me to what now seems obvious.

-- Robert Herold

-----Original Message-----
From: jdom-interest-bounces at jdom.org [mailto:jdom-interest-bounces at jdom.org]
On Behalf Of Robert Herold
Sent: Wednesday, March 29, 2006 2:53 PM
To: jdom-interest at jdom.org
Subject: RE: [jdom-interest] Special characters not being encoded as UTF-8

The problem showed up while composing and sending XML between applications
over the network, so System.out.println never figured into my real problem,
just into the test case demonstrating it.  

I understand now, however, that it is simply an output issue.  I'll
investigate how to send true UTF-8 from my obstensibly correct String
representation of the XML.  Thanks for setting me straight, and apologies
for the bother.

(It figures that it would be pilot error - JDOM has been stable for a while,
and I'm a relatively new user...)

-- Robert Herold

-----Original Message-----
From: Paul Libbrecht [mailto:paul at activemath.org]
Sent: Tuesday, March 28, 2006 11:45 PM
To: Jason Hunter
Cc: Robert Herold; jdom-interest at jdom.org
Subject: Re: [jdom-interest] Special characters not being encoded as UTF-8

System.out.println(string) is a complete killer for anything else than ASCII
since it doesn't make the encoding explicit.

But  System.out is a stream so new
XMLOutputter().output(document,System.out) should do a proper work.

How you see it in the console is yet another challenge, btw!
Try first to pipe the output of the process to a file then see it with
various encodings.

paul

Jason Hunter wrote:
> XMLOutputter does output as UTF-8 unless you dictate otherwise, but 
> you're asking the outputter to return a String.  So it does, and 
> Strings in Java are just a sequence of characters (they have no 
> associated byte encoding).  Then when you print that string with 
> System.out you're dropping into your system's native charset which 
> probably isn't UTF-8.
>
> Bottom line, you're printing a String using System.out which isn't
> UTF-8 friendly.  XMLOutputter did the proper job returning an abstract 
> String representation of the chars.
>
> -jh-
>
> Robert Herold wrote:
>> I'm trying to produce XML with special characters (e.g. ascii 0xA7, 
>> which is the section-sign) in the text content of an element.  I 
>> would expect XMLOutputter to encode these characters as UTF-8, but it 
>> doesn't.
>> How do I
>> get it to encode the special characters as UTF-8?  Or do I have to 
>> encode them before adding to the document?
>>
>> Consider this test program:
>>
>> import org.jdom.Document;
>> import org.jdom.Element;
>> import org.jdom.input.SAXBuilder;
>> import org.jdom.output.XMLOutputter;
>>
>> public class OutputXML {
>>     private static String SECTION_SIGN = "§";
>>
>>     public static void main(String[] args) {
>>
>>         Document doc1 = new Document();
>>         Element elem = new Element("elem");
>>         doc1.setRootElement(elem);
>>         elem.addContent(SECTION_SIGN);
>>
>>         XMLOutputter outputter = new XMLOutputter();
>>         String text = outputter.outputString(doc1);
>>         System.out.println(text);
>>     }
>> }
>>
>> It produces the output:
>>
>> <?xml version="1.0" encoding="UTF-8"?> <elem>§</elem>
>>
>> In a hex-dump of the output, one can see that the section-sign is 
>> left as
>> 0xA7 (at offset 0x2e in the output), instead of being UTF-8 encoded:
>>
>> 000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31  ><?xml 
>> version="1< 000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54
>> >.0"
>> encoding="UT<
>> 000020 46 2d 38 22 3f 3e 0d 0a 3c 65 6c 65 6d 3e a7 3c
>> >F-8"?>..<elem>.<<
>> 000030 2f 65 6c 65 6d 3e 0d 0a 0d 0a                    >/elem>....<
>>
>> Shouldn't XMLOutputter encode this character as UTF-8?
>>
>> Thanks for any insights, and forgive me if this is answered elsewhere
>> - I
>> couldn't find it in a morning of searching!
>>
>> -- Robert Herold
>>
>>
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.c
>> om
>>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.co
> m
>





_______________________________________________
To control your jdom-interest membership:
http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com






More information about the jdom-interest mailing list