[jdom-interest] Dealing with binary characters in-memory
->outputter, Sample Code, Findings
Jason Hunter
jhunter at collab.net
Tue Sep 25 21:08:06 PDT 2001
> Discovery # 1:
> In a "properly" output XML file using UTF-8 (the default) the
> odd single byte 0xA9 (MS copyright) is output as a TWO
> character sequence. If you have 0xA9 in memory you will get:
>
> 0xC2 0xA9
>
> When read back in this properly collapses back to just 0xA9.
Good, that's the behavior I expected. UTF-8 makes multibyte anything >
127 decimal.
> Discovery # 2:
> If you edit the ASCII XML file and remove the 0xC2 prefix
> you will get an exception when you read the file back in.
Sure, because you broke the UTF-8 multibyte encoding sequence. That
char was two bytes long, not one, and you removed the first byte.
> So it seems UTF-8 will "allow" any character, IF it's escaped with
> a prefix - but a bare character is an exception.
Yep, you should read up on UTF-8; it's a slick encoding that lets ASCII
chars be one byte, but uses special prefixes to indicate the presence of
multibyte chars. I think my book talks about it.
http://www.servlets.com/jservlet2. At least I learned a lot about
charsets when writing the book.
> Discovery # 3: (at least for me)
> It makes a difference whether you use a Java output stream or
> a Java "writer".
>
> If I send XMLOutputter xo.output() a plain FileOutputStream
> it works just fine! I do get the propper 2 character sequence.
>
> If I send xo.output() an OutputStreamWriter created from an
> a BufferedOutputStream it breaks. This is what I was doing
> wrong; I had thought the buffering would be important to have
> and didn't set any code page stuff.
There's a BIG warning in the Javadocs about this. Your writer wasn't a
UTF-8 Writer. In the methods accepting a Writer it warns you have to be
DARN sure what you're doing.
So we're all good then.
-jh-
More information about the jdom-interest
mailing list