[jdom-interest] JDOM Exception: invalid XML character ( Unicod
e: 0xb) found
Dennis Sosnoski
dms at sosnoski.com
Tue Jul 23 21:30:26 PDT 2002
As Les said, it's safe to check for these values directly in the file
bytes - they're not changed by the UTF encoding, and those values are
never used by UTF for encoding other characters. If you're reading the
file as a *character* stream the file encoding won't matter anyway,
since the bytes of the file will have already been converted into
characters.
The character values below 0x20 aren't the only ones that can be
problems, but they're the only ones you'll ever encounter in ASCII files
(other character values illegal in XML are 0xD800-0xDFFF, 0xFFFE-0xFFFF,
and anything above 0x10FFFF). See
http://www.w3c.org/TR/2000/REC-xml-20001006#charsets for the official
reference on this.
- Dennis
Les Hill wrote:
>From: "Charlie Wu" <cwu at brocade.com>
>
>
>>The other question then, is: can I go over my XML file as a character
>>stream and evaluate them byte by byte and remove anything between 0 and
>>0x20 (except the 3 you mentioned)? Would this be a problem for UTF-8
>>
>>
>because
>
>
>>they could be multi-byted?
>>
>>
>
>No. 0x00-0x7F are one-byte only in UTF-8.
>
>For more info, here is a recycled answer:
>
>Alex Rosen writes:
>
>
>>Read more about Unicode and the various
>>encodings, e.g. http://www.cl.cam.ac.uk/~mgk25/unicode.html
>>
>>
>
>Les Hill
>leh at galaxynine.com
>
>
>
>
More information about the jdom-interest
mailing list