[jdom-interest] Encoding not working as expected - Copyright Symbol

Jason Hunter jhunter at collab.net
Fri Jun 22 01:33:13 PDT 2001


Ah, found the problem.  You're using a standard FileWriter which uses
the default system encoding, which is Latin-1.  So you're telling
XMLOutputter you want to write UTF-8 but then you pass a Writer which is
only equipped to do Latin-1.  Changing your code to the following works:

    SAXBuilder builder = new SAXBuilder();
    Document doc = builder.build(args[0]);
    XMLOutputter out = new XMLOutputter("  ", true, "UTF-8");
    Writer writer = new OutputStreamWriter(
                    new FileOutputStream("output.xml"), "UTF-8");
    out.output(doc, writer);
    writer.close();

Note the big "Warning" in the XMLOutputter.output(Document, Writer)
method that explains this.

This will make a good FAQ entry...

-jh-

Christian Cabanero wrote:
> 
> First off, let me just say that JDom is DA BOMB DIGITY!  Congrats on
> building such a great product.
> 
> But unfortunately, something's been confusing me.  I have an XML document
> that is UTF-8 encoded and contains the (C) symbol encoded with UTF-8 (at
> least I assume that it is).  It shows up in my XML as...
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <article section="ECONOMIC">
>         <copyright>\302\251 Copyright 2001 USA TODAY, a division of Gannett Co.
> Inc.</copyright>
> </article>
> 
> ...the "\302\251" being the character for (C).  I load this XML file using a
> SAXBuilder and then just spit it right out again into another file using an
> XMLOutputter like so...
> 
>   public static void main(String[] args) throws Exception {
>     SAXBuilder builder = new SAXBuilder();
>     Document doc = builder.build(args[0]);  // pass in the xml file
> containing the copyright symbol
>     XMLOutputter out = new XMLOutputter("  ", true, "UTF-8");
>     FileWriter writer = new FileWriter("output.xml");
>     out.output(doc, writer);
>     writer.close();
>   }
> 
> BUT, for some reason this modifies the XML data and messes up the copyright
> symbol...
> 
> output.xml:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <article section="ECONOMIC">
>         <copyright>\251 Copyright 2001 USA TODAY, a division of Gannett Co.
> Inc.</copyright>
> </article>
> 
> What happened to the copyright symbol?  Am I missing something?
> Subsequently, if I try to read in the resulting output.xml file I get a
> JDOMException which reports "Character conversion error: "Unconvertible
> UTF-8 character beginning with 0xa9" (line number may be too low)."
> 
> Any help would be very much appreciated.  I've been using JDom with a lot of
> success so far and just hit this snag, but otherwise have found it to be an
> exceptional product.
> 
> Thanks in advance!
> 
> -Christian Cabanero
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com



More information about the jdom-interest mailing list