[jdom-interest] Attribute.getSerializedForm bug [eg]

Jason Hunter jhunter at collab.net
Fri Apr 13 11:24:35 PDT 2001


> > It does matter.  That logic will have to move into XMLOutputter.
> > Did you post about that problem before?
> >
> 
> This logic is already included in XMLOutputter.escapeAttributeEntities

Good.

> By the way, I see that Megginson's writeEsc method in samples.sax.XMLWriter
> outputs character references for the characters greater than '\u007f', as
> follows:
> 
>         if (ch[i] > '\u007f') {
>             write("&#");
>             write(Integer.toString(ch[i]));
>             write(';');
> 
> Should XMLOutputter acquire this logic, too?

Hmm... that's a pretty ASCII-centric view of the world.  Someone dealing
with Japanese documents in Shift_JIS would probably want the text ouput
to be in pure Shift_JIS so it works well with their editor, and wouldn't
want every Japanese character represented by a character entity.  Such a
plan would also increase document size about five fold.

The general "right" solution is probably to check each character as it
goes out and if it's not in the chosen encoding's character set then
output a char entity.  The problem is that for many encodings such a
check isn't fast at all (less than this, greater than that, less than
this, greater than that), nor is the information about which chars are
in which character set easily available (to my knowledge).

Brainstorming options, maybe choices we present to the programmer:

a) Just write the text.  Don't do any checking.  I know what I'm doing
and want things fast.

b) Escape anything that's not ASCII (above \u007f).

c) Escape anything that's out of my chosen charset/encoding. 
Interestingly, this is the same as (a) for UTF-8 or UCS-2 where all
chars can be represented.  And it's similar to (b) for Latin-1 because
you'd just escape what's above \u00ff.

d) Do something about letting people override 
  protected String escapeElementEntities(String st) 
if they want special behavior.

If we only supported UTF-8, UCS-2, Latin-1, and ASCII, it'd be trivial
and proper to do (c).  It's the Shift_JIS charsets that make things
tricky.  I'd like to support any encoding Java knows about, but how can
we easily/quickly determine if a given char is in a character set or
not?

-jh-



More information about the jdom-interest mailing list