[jdom-interest] Dealing with binary characters in-memory -> outputter, Sample Code, Findings

Tue Sep 25 20:48:50 PDT 2001

Hi Jason,

> -----Original Message-----
> From: Jason Hunter [mailto:jhunter at acm.org]
> Sent: Tuesday, September 25, 2001 10:31 AM
> To: mbennett at ideaeng.com
> Cc: jdom-interest at jdom.org
> Subject: Re: [jdom-interest] Dealing with binary characters in-memory ->
> outputter
>
> I'm explaining how it's supposed to work.  It's possible reality doesn't
> quite match.  Do you want to send in the little test case?
>
> -jh-

Thanks for the support.  I was very interested to hear that you
had no problems.

I've included some code below, and discovered a few things in
the process.

The good news I am able to get it working by changing some of
my code.  I'm still a bit confused as to why these changes
fix it.

Discovery # 1:
In a "properly" output XML file using UTF-8 (the default) the
odd single byte 0xA9 (MS copyright) is output as a TWO
character sequence.  If you have 0xA9 in memory you will get:

	0xC2 0xA9

When read back in this properly collapses back to just 0xA9.

Discovery # 2:
If you edit the ASCII XML file and remove the 0xC2 prefix
you will get an exception when you read the file back in.

Message:
Error reading back in 'test.xml'.  Exception: org.jdom.JDOMException: Error
on l
ine 1 of document file:///D:/data/java/jdomchars/test.xml: An invalid XML
charac
ter (Unicode: 0xa9) was found in the element content of the document.

So it seems UTF-8 will "allow" any character, IF it's escaped with
a prefix - but a bare character is an exception.

Discovery # 3: (at least for me)
It makes a difference whether you use a Java output stream or
a Java "writer".

If I send XMLOutputter xo.output() a plain FileOutputStream
it works just fine!  I do get the propper 2 character sequence.

If I send xo.output() an OutputStreamWriter created from an
a BufferedOutputStream it breaks.  This is what I was doing
wrong; I had thought the buffering would be important to have
and didn't set any code page stuff.

Works:
	OutputStream outStream = new FileOutputStream( new File("test.xml") );
	xo.output( doc, outStream );

Breaks:
	OutputStream outStream = new FileOutputStream( new File("test.xml") );
	// Extra steps from old code
	OutputStream bOutStream = new BufferedOutputStream( outStream );
	Writer writer = new OutputStreamWriter( bOutStream );
	xo.output( doc, writer );

I confess I'm still a bit confused about the many Java IO
options and combinations, and when you should use which
combo.  I imagine that, added to this, are codepage settings
or encodings properties.  Java 1.4 adds nio as well, should
be "interesting" :-)

The Code:

I show the code below in it's "broken" form.  Changing a
couple lines fixes it, as indicated in the *** comments.

File: testchar.java
===================

// Demonstrate the problems outputting un-escaped characters

import java.io.*;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;

public class testchar
{

	// The string with a weird character
	//////////////////////////////////////
	static final String content =
		"Content: " +
		"MS Copyright symbol [" +
		(char) 0xA9 +
		"]";
	// Where to put test data
	static final String outputFileName = "test.xml";
	static final String outputTextFileName = "test.txt";

	// Create the doc, write it to disk, then read it back in
	//////////////////////////////////////////////////////////
	static void test()
	{

		// Step 1: Create an empty document
		// ===============================================
		SAXBuilder jdomBuilder = new SAXBuilder();
		Document jdomDocument = new Document( new Element("root") );

		// Step 2: add the questionalable content from an
		// in memory source
		// ===============================================
		jdomDocument.getRootElement().addContent( content );

		// Step 3: Write it to disk
		// ===============================================

		XMLOutputter xo = new XMLOutputter();
		// Or
		//XMLOutputter xo = new XMLOutputter( "  ", true, "encoding" );

		File outFile = new File( outputFileName );
		OutputStream outStream = null;
		try
		{
			outStream = new FileOutputStream( outFile );
		}
		catch (Exception e)
		{
			System.err.println( "Error outputting xml to file\n" + e );
			return;
		}

		// Extra steps from old code, apparently causes the problem
		// ========================================================
		OutputStream bOutStream = new BufferedOutputStream( outStream );
		Writer writer = new OutputStreamWriter( bOutStream );

		// Try outputting it
		try
		{
			// Printing document gives XML header with
			// default UTF-8 ecoding
			//xo.output( jdomDocument, outStream );

			// Printing element gives no XML header, so
			// no encoding, but still defaults to UTF-8
			// so makes no difference
			// *** Works ***
			//xo.output( jdomDocument.getRootElement(), outStream );

			// Using the writer vs outStream breaks outbound
			// encoding somehow
			// *** Breaks ***
			xo.output( jdomDocument.getRootElement(), writer );

		}
		catch (Exception e)
		{
			System.out.println( "Error outputting jdom element. ");
		}

		try { outStream.close(); } catch (Exception e) { }
		//try { writer.close(); } catch (Exception e) { }

		// Step 3b: Also create a text file
		// ===================================================
		// Just for a double check, write the same content
		// to a text file.

		File outFile2 = new File( outputTextFileName );
		OutputStream outStream2 = null;
		try
		{
			outStream2 = new FileOutputStream( outFile2 );
		}
		catch (Exception e)
		{
			System.err.println( "Error outputting xml to file\n" + e );
			return;
		}
		PrintWriter pw = new PrintWriter( outStream2 );
		pw.write( content );
		try { pw.close(); } catch (Exception e) { }

		// Step 4: Read the XML file back in
		// ==========================================

		Document rereadDoc;
		try {
			rereadDoc = jdomBuilder.build( outputFileName );
		}
		catch (Exception e) {
			System.err.println( "Error reading back in '" +
				outputFileName + "'.  Exception: " + e
				);
			return;
		}

		System.out.println( "Content = '" +
			rereadDoc.getRootElement().getText() +
			"'" );
	}

	public static void main( String[] args )
	{
		test();
	}

}