[jdom-interest] Internal DTD subset verification

Thu May 9 09:53:23 PDT 2002

Philip Nelson wrote:

>One caveat though.  Awhile back I remember testing with MinML2 and saw that
>most of the time during the parse was spent converting from byte to char (using
>ibm profiling tools).  Could the use of a byte array be skewing the results
>worse than reading from disk?
>
Conversion from bytes to chars is definitely a large part of actual 
parse time. But unless your documents are coming to you already in 
UNICODE (and AFAIK Java doesn't support direct UNICODE I/O) it's going 
to be an inevitable part of the parsing one way or another. To me it 
makes sense to include the conversion in the parse time because of this 
- some parsers include custom code to handle conversions from common 
encodings, providing a real performance boost that will apply in actual 
applications.

The one case I can think of where you'd get your documents already in 
char[] form is when they're coming from a database or such. Even here, 
you'd probably be better off storing them as UTF-8 and accessing them as 
byte[] instead, letting the parser convert to char as needed. The 
overhead should be lower this way than if you have the database do the 
conversion.

  - Dennis