[jdom-interest] Internal DTD subset verification
Dennis Sosnoski
dms at sosnoski.com
Thu May 9 09:53:23 PDT 2002
Philip Nelson wrote:
>One caveat though. Awhile back I remember testing with MinML2 and saw that
>most of the time during the parse was spent converting from byte to char (using
>ibm profiling tools). Could the use of a byte array be skewing the results
>worse than reading from disk?
>
Conversion from bytes to chars is definitely a large part of actual
parse time. But unless your documents are coming to you already in
UNICODE (and AFAIK Java doesn't support direct UNICODE I/O) it's going
to be an inevitable part of the parsing one way or another. To me it
makes sense to include the conversion in the parse time because of this
- some parsers include custom code to handle conversions from common
encodings, providing a real performance boost that will apply in actual
applications.
The one case I can think of where you'd get your documents already in
char[] form is when they're coming from a database or such. Even here,
you'd probably be better off storing them as UTF-8 and accessing them as
byte[] instead, letting the parser convert to char as needed. The
overhead should be lower this way than if you have the database do the
conversion.
- Dennis
More information about the jdom-interest
mailing list