[jdom-interest] SAXBuilder: How to handle non UTF-8 characters? (JDOMParseException)

Matthias Klein matthias at cmklein.de
Wed Nov 10 20:01:35 PST 2004


I have several files containing the results of an ItemSearchRequest at
Amazon.com. 
Most files are 100-200kB XML files, which are, according to the XML
declaration, UTF-8 encoded.

I read the file from Amazons REST interface (target is the url which
responds with the XML file):

    Reader reader = new InputStreamReader(url.openStream());
    BufferedReader bufferedreader = new BufferedReader(reader);
    StringBuffer sb = new StringBuffer();    
    while (((c = bufferedreader.read()) != -1) && (c != 0)) {

                sb.append((char)c); 
            }
    result = sb.toString();

The string "result" will then be written into a RandomAccessFile.

Yet when I try to build a JDOM Document from the file using

   Document doc = builder.build(file);

I keep getting a JDOMParseException for some of the files. Reason: The file
apparently contains non UTF-8 characters.

Question: How can I get the SAXBuilder to ignore those characters? Does
anybody know the reason why those characters even appear every once in a
while?

Below is the first part of the exception I mentioned.

Thanks

Matt


org.jdom.input.JDOMParseException: Error on line 1 of document
file:/d:/JavaCode/result.xml: Zeichenumwandlungsfehler: "Malformed UTF-8
char -- is an XML encoding declaration missing?" (Zeilenzahl möglicherweise
zu niedrig)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:465)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:810)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:789)
        at AmazonConnector.doItemSearch(AmazonConnector.java:98)
        at MidTermProjectMain.main(MidTermProjectMain.java:41)
Caused by: org.xml.sax.SAXParseException: Zeichenumwandlungsfehler:
"Malformed UTF-8 char -- is an XML encoding declaration missing?"
(Zeilenzahl möglicherweise zu
niedrig)
        at
org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1100)
        at
org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1072)
        at org.apache.crimson.parser.InputEntity.isEOF(InputEntity.java:262)
        at
org.apache.crimson.parser.InputEntity.parsedContent(InputEntity.java:472)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1871)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1552)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1824)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1552)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1824)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1552)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1824)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1552)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1824)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1552)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1824)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1552)
        at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:534)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:318)
        at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
        ... 4 more





More information about the jdom-interest mailing list