[jdom-interest] Re: Manipulating a very large XML file
Tatu Saloranta
cowtowncoder at yahoo.com
Mon Mar 14 18:56:56 PST 2005
--- Jason Robbins <jrobbins at tigris.org> wrote:
>
>
> >We have a large XML file (around 5 GB) that should
> be modified based on
> >certain business rules. What parser can be used
> other than DOM ? Is it
> >possible to create a tree structure just for the
> segment that should be
> >modified ?
>
> As others pointed out, if you have 5GBs of data, you
> probably should
> not be keeping it in one XML file. Did you mean
> 5MB?
Actually, I have heard of XML files of that size
(since that's beyond 32-bit int range and causes
problems for location info...)
However, at that point it would make sense to consider
a streaming (SAX, StAX) approach: they will scale to
large files reasonably well.
...
> I am thinking of an internal data structure that
> would represent
> the complete DOM or JDOM tree for an N-byte XML file
> in more like
> N/2 to N bytes of RAM. This data structure would be
> fully parsed
I am quite sceptical about getting to N/2, since
although there is a bit redundancy (element and
attribute names), it usually won't do more than offset
the natural overhead (4 bytes per reference value to
represent shared names; or for namespace-aware cases,
8 bytes). Text content is usually in UTF-8, which
generally uses one byte per char (for docs that have
english text that is). It would be possible to use
gzip (etc) algorithm for text, to get to N/2, but that
would only/mostly help for docs with long text
segments...
and would be slower to access.
That said, it definitely would be worth considering, a
JDOM implementation that would be more memory
limitation aware. The first concern though may be the
fact that JDOM was not designed for multiple
implementations.
-+ Tatu +-
__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/
More information about the jdom-interest
mailing list