[jdom-interest] Dealing with binary characters in-memory ->
outputter, Sample Code, Findings
Mark Bennett
mbennett at ideaeng.com
Tue Sep 25 20:48:50 PDT 2001
Hi Jason,
> -----Original Message-----
> From: Jason Hunter [mailto:jhunter at acm.org]
> Sent: Tuesday, September 25, 2001 10:31 AM
> To: mbennett at ideaeng.com
> Cc: jdom-interest at jdom.org
> Subject: Re: [jdom-interest] Dealing with binary characters in-memory ->
> outputter
>
> I'm explaining how it's supposed to work. It's possible reality doesn't
> quite match. Do you want to send in the little test case?
>
> -jh-
Thanks for the support. I was very interested to hear that you
had no problems.
I've included some code below, and discovered a few things in
the process.
The good news I am able to get it working by changing some of
my code. I'm still a bit confused as to why these changes
fix it.
Discovery # 1:
In a "properly" output XML file using UTF-8 (the default) the
odd single byte 0xA9 (MS copyright) is output as a TWO
character sequence. If you have 0xA9 in memory you will get:
0xC2 0xA9
When read back in this properly collapses back to just 0xA9.
Discovery # 2:
If you edit the ASCII XML file and remove the 0xC2 prefix
you will get an exception when you read the file back in.
Message:
Error reading back in 'test.xml'. Exception: org.jdom.JDOMException: Error
on l
ine 1 of document file:///D:/data/java/jdomchars/test.xml: An invalid XML
charac
ter (Unicode: 0xa9) was found in the element content of the document.
So it seems UTF-8 will "allow" any character, IF it's escaped with
a prefix - but a bare character is an exception.
Discovery # 3: (at least for me)
It makes a difference whether you use a Java output stream or
a Java "writer".
If I send XMLOutputter xo.output() a plain FileOutputStream
it works just fine! I do get the propper 2 character sequence.
If I send xo.output() an OutputStreamWriter created from an
a BufferedOutputStream it breaks. This is what I was doing
wrong; I had thought the buffering would be important to have
and didn't set any code page stuff.
Works:
OutputStream outStream = new FileOutputStream( new File("test.xml") );
xo.output( doc, outStream );
Breaks:
OutputStream outStream = new FileOutputStream( new File("test.xml") );
// Extra steps from old code
OutputStream bOutStream = new BufferedOutputStream( outStream );
Writer writer = new OutputStreamWriter( bOutStream );
xo.output( doc, writer );
I confess I'm still a bit confused about the many Java IO
options and combinations, and when you should use which
combo. I imagine that, added to this, are codepage settings
or encodings properties. Java 1.4 adds nio as well, should
be "interesting" :-)
The Code:
I show the code below in it's "broken" form. Changing a
couple lines fixes it, as indicated in the *** comments.
File: testchar.java
===================
// Demonstrate the problems outputting un-escaped characters
import java.io.*;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
public class testchar
{
// The string with a weird character
//////////////////////////////////////
static final String content =
"Content: " +
"MS Copyright symbol [" +
(char) 0xA9 +
"]";
// Where to put test data
static final String outputFileName = "test.xml";
static final String outputTextFileName = "test.txt";
// Create the doc, write it to disk, then read it back in
//////////////////////////////////////////////////////////
static void test()
{
// Step 1: Create an empty document
// ===============================================
SAXBuilder jdomBuilder = new SAXBuilder();
Document jdomDocument = new Document( new Element("root") );
// Step 2: add the questionalable content from an
// in memory source
// ===============================================
jdomDocument.getRootElement().addContent( content );
// Step 3: Write it to disk
// ===============================================
XMLOutputter xo = new XMLOutputter();
// Or
//XMLOutputter xo = new XMLOutputter( " ", true, "encoding" );
File outFile = new File( outputFileName );
OutputStream outStream = null;
try
{
outStream = new FileOutputStream( outFile );
}
catch (Exception e)
{
System.err.println( "Error outputting xml to file\n" + e );
return;
}
// Extra steps from old code, apparently causes the problem
// ========================================================
OutputStream bOutStream = new BufferedOutputStream( outStream );
Writer writer = new OutputStreamWriter( bOutStream );
// Try outputting it
try
{
// Printing document gives XML header with
// default UTF-8 ecoding
//xo.output( jdomDocument, outStream );
// Printing element gives no XML header, so
// no encoding, but still defaults to UTF-8
// so makes no difference
// *** Works ***
//xo.output( jdomDocument.getRootElement(), outStream );
// Using the writer vs outStream breaks outbound
// encoding somehow
// *** Breaks ***
xo.output( jdomDocument.getRootElement(), writer );
}
catch (Exception e)
{
System.out.println( "Error outputting jdom element. ");
}
try { outStream.close(); } catch (Exception e) { }
//try { writer.close(); } catch (Exception e) { }
// Step 3b: Also create a text file
// ===================================================
// Just for a double check, write the same content
// to a text file.
File outFile2 = new File( outputTextFileName );
OutputStream outStream2 = null;
try
{
outStream2 = new FileOutputStream( outFile2 );
}
catch (Exception e)
{
System.err.println( "Error outputting xml to file\n" + e );
return;
}
PrintWriter pw = new PrintWriter( outStream2 );
pw.write( content );
try { pw.close(); } catch (Exception e) { }
// Step 4: Read the XML file back in
// ==========================================
Document rereadDoc;
try {
rereadDoc = jdomBuilder.build( outputFileName );
}
catch (Exception e) {
System.err.println( "Error reading back in '" +
outputFileName + "'. Exception: " + e
);
return;
}
System.out.println( "Content = '" +
rereadDoc.getRootElement().getText() +
"'" );
}
public static void main( String[] args )
{
test();
}
}
More information about the jdom-interest
mailing list