[jdom-interest] Dealing with binary characters in-memory -> outputter

Mon Sep 24 17:56:49 PDT 2001

> 
> I'm sorry to be so dense but I don't think this works.
>

right!  I don't think it can work.  A character entity will be expanded by
the parser.  In element content we don't check for valid characters for
performance reasons.  So now 0xA9 is expanded into a char.  On output you no
longer have any character entity so the char is output as is.  On the second
read, there is no way the parser should accept this.  It is not a valid xml
character and is rejected on the second pass.

Am I close?

> I like your proposed approach.  Our plan thus far has been: For UTF-8
> encoding (the default) no encoding is necessary except for special
> characters (ie <) which we take care of.  For other encodings you set,
> you're responsible for handling things yourself.  
> 
> Your approach is to help handle other encodings.  Sounds good to me. 
> The problem we hit when thinking about it earlier is there's 
> no standard
> library of which chars are in each character set.  So there 
> may have to
> be a few supported "other encodings" (aka Latin-1) and other ones you
> want to use you have to write yourself (and perhaps donate).
> 
> Thoughts?

Yeah we talked about this when we were talking about character entities that
are lost by the parser and handed to JDOM via sax.  I think the idea is good
but I would like it to not be tied to the xmloutputter methods.  What we
really need to know is if a character is in a particular encoding.  This is
a lot like the isXMLChar routines we have now in Verifier but encoding
specific.  If we knew to use a particular encoding while parsing, we could
determine what characters should be recast as character entities in the dtd.
We could also use the encoding to escape the correct characters on
ouputting.  We must also be able to ignore the encoding entirely so we don't
have to check this unless the encoding is specified.

To test the Verifier class, I made a class that parsed the valid xml
characters from the xml spec. document.  Maybe what would make sense is to
create a code generator that reads from an xml document, generates the
encoding class source from there.  We could provide encodings for the most
common cases of course.  Use cases beyond this would have to either create
there own encoding xml and generate the class or subclass xmloutputter as
was done in the workaround mentioned earlier.

But then I'm anxious for the Packer game to start ;-)  Maybe I'll shoot out
starter class at halftime.....

> 
> -jh-
> 
> "Trimmer, Todd" wrote:
> > 
> > Attila Szegedi writes:
> > 
> > The XMLOutputter authors do a pretty good job of &# escaping "common
> > renegade" characters, so maybe the ultimate solution is to 
> add this one to
> > the set... The problem is that for every encoding, the set 
> of chars that
> > must be escaped is different, and solving this problem on a 
> per-encoding
> > basis would be too expensive, either in memory or in time 
> terms. Using the
> > newly-introduced Encoder interface in java.io. of JDK1.4 
> should help, but
> > it'll take time until it gets mainstream...
> > 
> > -=-=-=-
> > 
> > I have never seen XMLOutputter produce a "&#" escaping 
> under any encoding.
> > Looking at the source for escapeAttributeEntities() and
> > escapeElementEntities(), I don't see how it possibly could.
> > org.jdom.output.XMLWriter DOES escape characters this way, 
> yet it does not
> > take the encoding into consideration.
> > 
> > If different encodings need different characters escaped, 
> then why not have
> > a static inner class for each encoding? Sounds like a good 
> use of a Strategy
> > Pattern to me.
> > 
> > By having them be inner classes we are marrying the encodings to the
> > XMLOutputter. It would be better if a programmer can supply 
> his own Encoder
> > via a setter method for a more esoteric encoding. Yes, 
> java.io.Encoder is a
> > JDK1.4 thing, but it doesn't look to hard for us to roll our own
> > org.jdom.output.Encoder interface, with stock 
> implementations for the most
> > common encodings.
> > 
> > I, too, came across the same problems with XMLOutputter 
> that Bennett was
> > having. I was also trying to use JDOM to read and 
> manipulate HTML and then
> > spit it out to another process. The lack of "&#" disturbed 
> me so much that I
> > subclassed XMLOutputter as HTMLOutputter and overrode
> > escapeAttributeEntities() and escapeElementEntities() to "&#"-escape
> > ISO-Latin characters above 168. Yes, it's a specific fix to 
> a specific
> > problem, but, Bennet, I propose you use this workaround 
> until the solution
> > with the Strategy Pattern can be written.
> > 
> > To get the ball rolling, what do readers of this newsgroup propose
> > org.jdom.output.Encoder have other than the following?
> > 
> > package org.jdom.output;
> > 
> > public interface Encoder
> > {
> >         protected String escapeAttributeEntities(String st);
> > 
> >         protected String escapeElementEntities(String st);
> > }
> > _______________________________________________
> > To control your jdom-interest membership:
> > 
> http://lists.denveronline.net/mailman/options/jdom-interest/yo
uraddr at yourhost.com
_______________________________________________
To control your jdom-interest membership:
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhos
t.com