[jdom-interest] SAXHandler / CDATA / entities

Ingo Struck ingo at ingostruck.de
Tue Nov 19 14:11:50 PST 2002


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi...

> I am confused by your statement....
>
> JDOM does cope with CDATA just fine. You can put all of those characters in
> a CDATA now.
Right... I erred regarding this point - it really works.

What does *not* work properly is the decoding of characters.
The basic problem here is, that the decoding happens *before* parsing, i.e.
if I want to spare the CDATA section, I would just say something like:

<SomeNode>Here is some embedded HTML with a &#60;br&#62; in it.</SomeNode>

(The reason for using numeric encoding is, that most chars can be encoded 
using uniform length; a fact that could be used to significantly speed up the 
escaping process; if you want all ascii chars with uniform length, then it is 
even better to use the hexadecimal form)
If you feed this into jdom, what happens is that the chars are decoded to
 
<SomeNode>Here is some embedded HTML with a <br> in it.</SomeNode>

which, of course, is not valid XML. The solution provided here (to exclude the 
five "named" entities and - what I proposed as a fix - the respective numeric 
entities) is the wrong approach imho. It would be much cleaner to parse the 
document and decode the characters *afterwards*. Then you can be 100% sure 
that the parsed document really contains only the nodes of the serialized 
form and not some "embedded" stuff that has been decoded/parsed by error.

Kind regards

Ingo Struck

- -- 
ingo at ingostruck.de
Use PGP: http://ingostruck.de/ingostruck.gpg with fingerprint
C700 9951 E759 1594 0807  5BBF 8508 AF92 19AA 3D24
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE92rcrhQivkhmqPSQRAuH0AJ9i0YvAs1r+n55uwrJdYVrI8Cr1MgCgpsI1
gMZzGUA+A7umw1zJEWZOs8g=
=ZAWf
-----END PGP SIGNATURE-----




More information about the jdom-interest mailing list