[jdom-interest] A suggested performance improvement

Vadim.Strizhevsky at morganstanley.com Vadim.Strizhevsky at morganstanley.com
Mon Mar 17 19:06:26 PST 2003


In my own testing I also found that Verifier was taking a significant
portion of the time (at least 20% ) of reading XML files in. What I also
found concerning is that as far as I can tell, on reading the XML file,
this is a duplication of the identical effort that the parser has already
done. JDOM will never get callbacks from the correct SAX parser with
characters that  would fail these tests. So in theory JDOM should only do
this for text that doesn't come from XML parser, or at least from a
"correct" XML  parser. Not sure how you could do this as the same API is
used to construct the JDOM structure in 2 cases, but I think its worth
thinking about it.

Both xerces and crimson (the only 2 I checked) do this using very similar
but more efficient test routines. xerces actually has the complete array
built up statically for all 0000-FFFF chars, so checks are simple
lookups. Crimsons uses devide and concur based on byte ranges.

-Vadim

On Mon, 17 Mar 2003, Alex Rosen wrote:

> Very interesting. I guess your document has lots of text content in it?
>
> What platform and VM are you running on? It's too bad HotSpot doesn't
> inline isXMLCharacter, I guess it's too big. Making it final doesn't
> help does it?
>
> Anyway, your suggestion seems like a good idea, though a bummer that we
> have to do it.
>
> BTW - the last two "if"s in isXMLCharacter are useless, since a char
> can never be more than FFFF.
>
> Which brings up another point. If I understand things correctly, JDK
> 1.5 will support Unicode characters larger than FFFF, which will
> probably be represented by surrogate pairs, so all these isXML...
> methods will need to be completely revamped at that time. (You won't be
> able to check for a valid character by checking just one char.) What a
> mess.
>
> Plus, if we ever want to support XML 1.1 when it comes out, we'll need
> to figure out what to do with Verifier again - we'll need two different
> versions then. If it weren't for Verifier, all we'd need to deal with is
> outputting version="1.1".
>
> Yup... I really hate the Verifier.
>
> Alex
>
> >>> Tom Oke <tomo at elluminate.com> 3/16/2003 8:18:44 PM >>>
> I have noticed, on large XML files, that the majority of the CPU time
> is going into the routines: Verifier.isXMLCharacter and
> Verifier.checkCharacterData.
>
> I had initially modified isXMLCharacter to have it check the most
> likely range of data first, to get a short exit, and this took off
> about 25% of the CPU used in some large files, for the JDOM read.
>
> However, in the thread doing the JDOM input, 62% of the time
> was still in isXMLCharacter and 16% was in checkCharacterData,
> which calls isXMLCharacter.
>
> The biggest bang for the buck was by enclosing the
> if statement with isXMLCharacter with a test for the
> most likely good range. This is seen below in the two
> lines:
>
>             char c = text.charAt(i);
>             if (!(c > 0x1F && c < 0xD800)) {
>
> This reduced checkCharacterData to 1.32% of the thread use,
> and isXMLCharacter doesn't really show up at all.
>
> Hopefully this is a reasonable change to submit to JDOM?
>
> What follows is the full code for Verifier.checkCharacterData.
>
>
>
>     public static final String checkCharacterData(String text) {
>         if (text == null) {
>             return "A null is not a legal XML value";
>         }
>
>         // do check
>         for (int i = 0, len = text.length(); i<len; i++) {
>             char c = text.charAt(i);
>             if (!(c > 0x1F && c < 0xD800)) {
>                 if (!isXMLCharacter(text.charAt(i))) {
>                     // Likely this character can't be easily displayed
>                     // because it's a control so we use it'd
> hexadecimal
>                     // representation in the reason.
>                     return ("0x" + Integer.toHexString(text.charAt(i))
>                             + " is not a legal XML character");
>                 }
>             }
>         }
>
>         // If we got here, everything is OK
>         return null;
>     }
>
> Tom Oke
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>
>




More information about the jdom-interest mailing list