[jdom-interest] Encoding not working as expected - Copyright Symbol
Jason Hunter
jhunter at collab.net
Fri Jun 22 01:33:13 PDT 2001
Ah, found the problem. You're using a standard FileWriter which uses
the default system encoding, which is Latin-1. So you're telling
XMLOutputter you want to write UTF-8 but then you pass a Writer which is
only equipped to do Latin-1. Changing your code to the following works:
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(args[0]);
XMLOutputter out = new XMLOutputter(" ", true, "UTF-8");
Writer writer = new OutputStreamWriter(
new FileOutputStream("output.xml"), "UTF-8");
out.output(doc, writer);
writer.close();
Note the big "Warning" in the XMLOutputter.output(Document, Writer)
method that explains this.
This will make a good FAQ entry...
-jh-
Christian Cabanero wrote:
>
> First off, let me just say that JDom is DA BOMB DIGITY! Congrats on
> building such a great product.
>
> But unfortunately, something's been confusing me. I have an XML document
> that is UTF-8 encoded and contains the (C) symbol encoded with UTF-8 (at
> least I assume that it is). It shows up in my XML as...
>
> <?xml version="1.0" encoding="UTF-8"?>
> <article section="ECONOMIC">
> <copyright>\302\251 Copyright 2001 USA TODAY, a division of Gannett Co.
> Inc.</copyright>
> </article>
>
> ...the "\302\251" being the character for (C). I load this XML file using a
> SAXBuilder and then just spit it right out again into another file using an
> XMLOutputter like so...
>
> public static void main(String[] args) throws Exception {
> SAXBuilder builder = new SAXBuilder();
> Document doc = builder.build(args[0]); // pass in the xml file
> containing the copyright symbol
> XMLOutputter out = new XMLOutputter(" ", true, "UTF-8");
> FileWriter writer = new FileWriter("output.xml");
> out.output(doc, writer);
> writer.close();
> }
>
> BUT, for some reason this modifies the XML data and messes up the copyright
> symbol...
>
> output.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <article section="ECONOMIC">
> <copyright>\251 Copyright 2001 USA TODAY, a division of Gannett Co.
> Inc.</copyright>
> </article>
>
> What happened to the copyright symbol? Am I missing something?
> Subsequently, if I try to read in the resulting output.xml file I get a
> JDOMException which reports "Character conversion error: "Unconvertible
> UTF-8 character beginning with 0xa9" (line number may be too low)."
>
> Any help would be very much appreciated. I've been using JDom with a lot of
> success so far and just hit this snag, but otherwise have found it to be an
> exceptional product.
>
> Thanks in advance!
>
> -Christian Cabanero
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
More information about the jdom-interest
mailing list