[jdom-interest] Dealing with binary characters in-memory ->outputter, Sample Code, Findings

Tue Sep 25 21:08:06 PDT 2001

> Discovery # 1:
> In a "properly" output XML file using UTF-8 (the default) the
> odd single byte 0xA9 (MS copyright) is output as a TWO
> character sequence.  If you have 0xA9 in memory you will get:
> 
>         0xC2 0xA9
> 
> When read back in this properly collapses back to just 0xA9.

Good, that's the behavior I expected.  UTF-8 makes multibyte anything >
127 decimal.

> Discovery # 2:
> If you edit the ASCII XML file and remove the 0xC2 prefix
> you will get an exception when you read the file back in.

Sure, because you broke the UTF-8 multibyte encoding sequence.  That
char was two bytes long, not one, and you removed the first byte.

> So it seems UTF-8 will "allow" any character, IF it's escaped with
> a prefix - but a bare character is an exception.

Yep, you should read up on UTF-8; it's a slick encoding that lets ASCII
chars be one byte, but uses special prefixes to indicate the presence of
multibyte chars.  I think my book talks about it. 
http://www.servlets.com/jservlet2.  At least I learned a lot about
charsets when writing the book.

> Discovery # 3: (at least for me)
> It makes a difference whether you use a Java output stream or
> a Java "writer".
> 
> If I send XMLOutputter xo.output() a plain FileOutputStream
> it works just fine!  I do get the propper 2 character sequence.
> 
> If I send xo.output() an OutputStreamWriter created from an
> a BufferedOutputStream it breaks.  This is what I was doing
> wrong; I had thought the buffering would be important to have
> and didn't set any code page stuff.

There's a BIG warning in the Javadocs about this.  Your writer wasn't a
UTF-8 Writer.  In the methods accepting a Writer it warns you have to be
DARN sure what you're doing.

So we're all good then.

-jh-