[jdom-interest] CDATA inconsistency
Elliotte Rusty Harold
elharo at metalab.unc.edu
Sat Nov 2 15:33:25 PST 2002
At 11:32 AM -0800 11/2/02, Malachi de AElfweald wrote:
>"unmatched halves of surrogate pairs".... That would be assuming
>UTF-8 specifically,
>would it not? ISO-8859-1, for example, does not have surrogate pairs.
>
No, it's assuming Java. A Java char is *not* a Unicode character. It
is a UTF-16 code point. In UTF-16 (UTF-8 does not use surrogate
pairs), characters from outside the basic Multilingual Plane (BMP)
are represented as two consecutive surrogate characters, an upper
half and a lower half. (I can never remember which is which.)
However, neither Java nor JDOM does any checking to make sure the
surrogates match up like they're supposed to. It just assumes each
char is legal.
--
+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
| XML in a Nutshell, 2nd Edition (O'Reilly, 2002) |
| http://www.cafeconleche.org/books/xian2/ |
| http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/ |
+----------------------------------+---------------------------------+
| Read Cafe au Lait for Java News: http://www.cafeaulait.org/ |
| Read Cafe con Leche for XML News: http://www.cafeconleche.org/ |
+----------------------------------+---------------------------------+
More information about the jdom-interest
mailing list