[jdom-interest] B9-rc1: inputstreams, or readers: Invalid encoding name "KSC5601"

Alex Rosen arosen at novell.com
Fri Apr 18 08:23:21 PDT 2003


When you use an InputStream, the parser can read the encoding name from
the XML file and set up its own Reader with the right encoding.

When you use a Reader, it's your responsibility to set it up. Which
would mean in this case that you'd need to read the encoding name out of
the file yourself, instead of letting the parser do it for you. So, if
at all possible, you should use an InputStream not a Reader. (I could've
sworn that the JavaDoc mentioned this but I don't see it.)

Alex

>>> Rolf Lear <rlear at algorithmics.com> 4/17/2003 9:32:33 AM >>>
My point is that the data passes XML SAXBuilder IF it is processed as
an
Input Stream, but fails as a Reader.

The encoding is processed "just fine" when the data is processed as a
Reader
InputSource, but fails as an InputStream.

As I say, I am unsure of where this is a bug, or even IF this is a bug,
but
it certainly is suspicious.

Attached is the Zipped XMLDocument which fails "well-formedness" as a
ByteStream, but passes as a Reader.

Here is my test code:

==============================
import java.io.FileInputStream;
import java.io.FileReader;

import org.jdom.input.SAXBuilder;

public class MainParse {

    public static void main(String[] args) {
        try {
            new SAXBuilder().build(new FileInputStream(args[0]));
            System.out.println("PASSED: Processed file as an input
stream.");
        } catch (Exception e) {
            System.out.println("FAILED: Processed file as an input
stream.");
            e.printStackTrace();
        }
        try {
            new SAXBuilder().build(new FileReader(args[0]));
            System.out.println("PASSED: Processed file as a Reader.");
        } catch (Exception e) {
            System.out.println("FAILED: Processed file as a Reader.");
            e.printStackTrace();
        }
    }
}
==================================

and this is my output from the command:
java -cp .:/lib/jaxen-jdom.jar:./lib/jdom.jar:./lib/xerces.jar
MainParse
mydoc_raw.xml


FAILED: Processed file as an input stream.
org.jdom.input.JDOMParseException: Error on line 1: Invalid encoding
name
"KSC5601".
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:381)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:684)
        at MainParse.main(MainParse.java:23)
Caused by: org.xml.sax.SAXParseException: Invalid encoding name
"KSC5601".
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)
        ... 2 more
Caused by: org.xml.sax.SAXParseException: Invalid encoding name
"KSC5601".
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:684)
        at MainParse.main(MainParse.java:23)
Caused by: org.xml.sax.SAXParseException: Invalid encoding name
"KSC5601".
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:684)
        at MainParse.main(MainParse.java:23)
PASSED: Processed file as an input stream.

Rolf



-----Original Message-----
From: Jason Hunter [mailto:jhunter at acm.org] 
Sent: Wednesday, April 16, 2003 6:48 PM
To: Rolf Lear
Cc: Jdom-Interest (E-mail)
Subject: Re: [jdom-interest] B9-rc1: inputstreams, or readers: Invalid
encoding name "KSC5601"


It may be that the encoding name isn't known to XML but may be known
to
Java.  There's a Xerces feature to tell it to respect Java names for
encodings.  Try that.

-jh-

> Rolf Lear wrote:
> 
> I have been trying to find/fix performance issues in JDom, and was
> playing around with the Verifier.
> 
> To test the effect of changes to the Verifier, I first load an XML
> Document in to memory, then parse it using SAXbuilder.build.
> 
> To test wierd XML, I found this:
>
http://ropas.kaist.ac.kr/viewcvs/viewcvs.cgi/*checkout*/n/nXml/testdata/docu

ment/mydoc_raw.xml?rev=HEAD&content-type=text/xml
> 
> which is partially Korean.
> 
> First, remove the Doctype declaration in the document.
> 
> My program does the following (See the code at the end).
> 
> It loads the file up as an array of bytes.
> It loads the file up as an array of Char.
> 
> It parses each through SAXBuilder.build using an inputstream on the
> bytes, and a reader on the chars.
> InputSource source = new InputSource(new
> ByteArrayInputStream(bytedata));
> and
> InputSource source = new InputSource(new CharArrayReader(chardata));
> 
> Now, parsing the Reader passes, and the InputStream fails with:
> Invalid encoding name "KSC5601" (in Xerces).
> 
> org.jdom.input.JDOMParseException: Error on line 1: Invalid encoding
> name "KSC5601".
>         at org.jdom.input.SAXBuilder.build(SAXBuilder.java:381)
>         at MainTest.main(MainTest.java:77)
> Caused by: org.xml.sax.SAXParseException: Invalid encoding name
> "KSC5601".
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> Source)
>         at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)
>         ... 1 more
> Caused by: org.xml.sax.SAXParseException: Invalid encoding name
> "KSC5601".
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> Source)
>         at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)
>         at MainTest.main(MainTest.java:77)
> Caused by: org.xml.sax.SAXParseException: Invalid encoding name
> "KSC5601".
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> Source)
>         at org.jdom.input.SAXBuilder.build(SAXBuilder.java:370)
>         at MainTest.main(MainTest.java:77)
> 
> Now I am the first to admit that my Unicode,charset knowledge is
> really flakey, so any suggestions as to whether this is a bug in my
> code, JDOM, or Xerces is welcome.
> 
> Rolf
> 
> ======================================================
> /*package default.*/
> import java.io.ByteArrayInputStream;
> import java.io.CharArrayReader;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileReader;
> import java.io.IOException;
> 
> import org.jdom.JDOMException;
> import org.jdom.input.SAXBuilder;
> import org.xml.sax.InputSource;
> 
> public class MainTest {
> 
>     private static byte[] loadedFileBytes(String filename) throws
> IOException {
>         File file = new File(filename);
>         byte[] buffer = new byte[(int)file.length()];
>         FileInputStream fis = new FileInputStream(file);
>         int got = 0;
>         int size = buffer.length;
>         for (got = 0; got < size; ) {
>             int read = fis.read(buffer, got, size - got);
>             if (read >= 0) {
>                 got += read;
>             } else {
>                 throw new IOException ("do not expect end of file
> before " + size + " bytes, but got it at " + got + " bytes.");
> 
>             }
>         }
>         if (fis.read() != -1) {
>             throw new IOException ("Thought we read to end of file,
> but there is still more.....");
>         }
>         return buffer;
>     }
> 
>     private static char[] loadedFileChars(String filename) throws
> IOException {
>         File file = new File(filename);
>         FileReader fr = new FileReader(file);
>         StringBuffer sb = new StringBuffer();
>         int read = 0;
>         char[] buffer = new char[1024*4];
>         while ((read = fr.read(buffer)) >= 0) {
>             sb.append(buffer, 0, read);
>         }
>         return sb.toString().toCharArray();
>     }
> 
>     public static void main(String[] args) throws
> ClassNotFoundException, IOException {
>         long start = System.currentTimeMillis();
>         Class.forName("org.jdom.Verifier").getDeclaredMethods();
>         long load = System.currentTimeMillis() - start;
>         System.out.println("Loaded Verifier Class: " + load +
"ms.");
>         int iterations = new Integer(args[0]).intValue();
>         SAXBuilder builder = new SAXBuilder(false);
>         for (int i = 1; i < args.length; i++) {
>             start = System.currentTimeMillis();
>             byte[] bytedata = loadedFileBytes(args[i]);
>             char[] chardata = loadedFileChars(args[i]);
>             load = System.currentTimeMillis() - start;
>             System.out.println("Loaded Data in File '" + args[i] +
"'
> in " + load + "ms. " + (bytedata.length / 1024) + "KB. " +
> (chardata.length / 1024) + " KChars About to SAXBuild");
> 
> 
>             try {
>                 for (int j = 0; j < iterations; j++) {
>                     InputSource source = new InputSource(new
> ByteArrayInputStream(bytedata));
>                     start = System.currentTimeMillis();
>                     builder.build(source);
>                     load = System.currentTimeMillis() - start;
>                     System.out.println("SAXBuilder built document '"
+
> args[i] + "' (BYTES) iteration " + j + " in " + load + "ms.");
> 
>                 }
>             } catch (JDOMException e) {
>                 e.printStackTrace();
>             } catch (IOException ioe) {
>                 ioe.printStackTrace();
>             }
>             try {
>                 for (int j = 0; j < iterations; j++) {
>                     InputSource source = new InputSource(new
> CharArrayReader(chardata));
>                     start = System.currentTimeMillis();
>                     builder.build(source);
>                     load = System.currentTimeMillis() - start;
>                     System.out.println("SAXBuilder built document '"
+
> args[i] + "' (CHARS) iteration " + j + " in " + load + "ms.");
> 
>                 }
>             } catch (JDOMException e) {
>                 e.printStackTrace();
>             } catch (IOException ioe) {
>                 ioe.printStackTrace();
>             }
>         }
>     }
> }
>
============================================================================
=======




More information about the jdom-interest mailing list