[jdom-interest] Re: Manipulating a very large XML file

Mon Mar 14 18:56:56 PST 2005

--- Jason Robbins <jrobbins at tigris.org> wrote:
> 
> 
> >We have a large XML file (around 5 GB) that should
> be modified based on
> >certain business rules. What parser can be used
> other than DOM ? Is it
> >possible to  create a tree structure just for the
> segment that should be
> >modified ?
> 
> As others pointed out, if you have 5GBs of data, you
> probably should
> not be keeping it in one XML file.  Did you mean
> 5MB?

Actually, I have heard of XML files of that size
(since that's beyond 32-bit int range and causes
problems for location info...)
However, at that point it would make sense to consider
a streaming (SAX, StAX) approach: they will scale to
large files reasonably well.

...
> I am thinking of an internal data structure that
> would represent
> the complete DOM or JDOM tree for an N-byte XML file
> in more like
> N/2 to N bytes of RAM.  This data structure would be
> fully parsed

I am quite sceptical about getting to N/2, since
although there is a bit redundancy (element and
attribute names), it usually won't do more than offset
the natural overhead (4 bytes per reference value to
represent shared names; or for namespace-aware cases,
8 bytes). Text content is usually in UTF-8, which
generally uses one byte per char (for docs that have
english text that is). It would be possible to use
gzip (etc) algorithm for text, to get to N/2, but that
would only/mostly help for docs with long text
segments...
and would be slower to access.

That said, it definitely would be worth considering, a
JDOM implementation that would be more memory
limitation aware. The first concern though may be the
fact that JDOM was not designed for multiple
implementations.

-+ Tatu +-

__________________________________ 
Do you Yahoo!? 
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/