[jdom-interest] CDATA inconsistency

Elliotte Rusty Harold elharo at metalab.unc.edu
Sat Nov 2 15:33:25 PST 2002


At 11:32 AM -0800 11/2/02, Malachi de AElfweald wrote:
>"unmatched halves of surrogate pairs".... That would be assuming 
>UTF-8 specifically,
>would it not? ISO-8859-1, for example, does not have surrogate pairs.
>

No, it's assuming Java. A Java char is *not* a Unicode character. It 
is a UTF-16 code point. In UTF-16 (UTF-8 does not use surrogate 
pairs), characters from outside the basic Multilingual Plane (BMP) 
are represented as two consecutive surrogate characters, an upper 
half and a lower half.  (I can never remember which is which.) 
However, neither Java nor JDOM does any checking to make sure the 
surrogates match up like they're supposed to.  It just assumes each 
char is legal.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
|              http://www.cafeconleche.org/books/xian2/              |
|  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+



More information about the jdom-interest mailing list