[jdom-interest] Special characters not being encoded as UTF-8

Jason Hunter jhunter at xquery.com
Tue Mar 28 15:22:22 PST 2006


XMLOutputter does output as UTF-8 unless you dictate otherwise, but 
you're asking the outputter to return a String.  So it does, and Strings 
in Java are just a sequence of characters (they have no associated byte 
encoding).  Then when you print that string with System.out you're 
dropping into your system's native charset which probably isn't UTF-8.

Bottom line, you're printing a String using System.out which isn't UTF-8 
friendly.  XMLOutputter did the proper job returning an abstract String 
representation of the chars.

-jh-

Robert Herold wrote:
> I'm trying to produce XML with special characters (e.g. ascii 0xA7, which is
> the section-sign) in the text content of an element.  I would expect
> XMLOutputter to encode these characters as UTF-8, but it doesn't.  How do I
> get it to encode the special characters as UTF-8?  Or do I have to encode
> them before adding to the document?
> 
> Consider this test program:
> 
> import org.jdom.Document;
> import org.jdom.Element;
> import org.jdom.input.SAXBuilder;
> import org.jdom.output.XMLOutputter;
> 
> public class OutputXML {
> 	private static String SECTION_SIGN = "§";
> 
> 	public static void main(String[] args) {
> 
> 		Document doc1 = new Document();
> 		Element elem = new Element("elem");
> 		doc1.setRootElement(elem);
> 		elem.addContent(SECTION_SIGN);
> 
> 		XMLOutputter outputter = new XMLOutputter();
> 		String text = outputter.outputString(doc1);
> 		System.out.println(text);
> 	}
> }
> 
> It produces the output:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <elem>§</elem>
> 
> In a hex-dump of the output, one can see that the section-sign is left as
> 0xA7 (at offset 0x2e in the output), instead of being UTF-8 encoded:
> 
> 000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31  ><?xml version="1<
> 000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54  >.0" encoding="UT<
> 000020 46 2d 38 22 3f 3e 0d 0a 3c 65 6c 65 6d 3e a7 3c  >F-8"?>..<elem>.<<
> 000030 2f 65 6c 65 6d 3e 0d 0a 0d 0a                    >/elem>....<
> 
> Shouldn't XMLOutputter encode this character as UTF-8?
> 
> Thanks for any insights, and forgive me if this is answered elsewhere - I
> couldn't find it in a morning of searching!
> 
> -- Robert Herold
> 
> 
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
> 


More information about the jdom-interest mailing list