[jdom-interest] JDOM Exception: invalid XML character ( Unicod e: 0xb) found

Tue Jul 23 21:30:26 PDT 2002

As Les said, it's safe to check for these values directly in the file 
bytes - they're not changed by the UTF encoding, and those values are 
never used by UTF for encoding other characters. If you're reading the 
file as a *character* stream the file encoding won't matter anyway, 
since the bytes of the file will have already been converted into 
characters.

The character values below 0x20 aren't the only ones that can be 
problems, but they're the only ones you'll ever encounter in ASCII files 
(other character values illegal in XML are 0xD800-0xDFFF, 0xFFFE-0xFFFF, 
and anything above 0x10FFFF). See 
http://www.w3c.org/TR/2000/REC-xml-20001006#charsets for the official 
reference on this.

  - Dennis

Les Hill wrote:

>From: "Charlie Wu" <cwu at brocade.com>
>  
>
>>The other question then, is: can I go over my XML file as a character
>>stream and evaluate them byte by byte and remove anything between 0 and
>>0x20 (except the 3 you mentioned)? Would this be a problem for UTF-8
>>    
>>
>because
>  
>
>>they could be multi-byted?
>>    
>>
>
>No. 0x00-0x7F are one-byte only in UTF-8.
>
>For more info, here is a recycled answer:
>
>Alex Rosen writes:
>  
>
>>Read more about Unicode and the various
>>encodings, e.g. http://www.cl.cam.ac.uk/~mgk25/unicode.html
>>    
>>
>
>Les Hill
>leh at galaxynine.com
>
>
>  
>