[jdom-interest] Re: Manipulating a very large XML file
Jason Robbins
jrobbins at tigris.org
Tue Mar 15 07:57:24 PST 2005
Hi Tatu, Cecil, and everyone,
On Mon, 2005-03-14 at 18:56, Tatu Saloranta wrote:
> I am quite sceptical about getting to N/2, since
> although there is a bit redundancy (element and
> attribute names), it usually won't do more than offset
> the natural overhead (4 bytes per reference value to
> represent shared names; or for namespace-aware cases,
> 8 bytes).
Exactly. That was the hardest thing for me to get around.
But reducing the number of bytes per node in the representation
of the tree structure from around 80 down to more like 8
is the key to the whole thing.
> Text content is usually in UTF-8, which
> generally uses one byte per char (for docs that have
> english text that is). It would be possible to use
> gzip (etc) algorithm for text, to get to N/2, but that
> would only/mostly help for docs with long text
> segments... and would be slower to access.
You are right, that would be way too slow. If your document
is heavy on text, it would be more like N bytes of RAM.
If your document is heavy on markup, the ratio can get lower.
I am not thinking of using any standard text compression.
But, I suppose that some aspects of my tree structure representation
could probably be found in a book about compression.
On Tue, 2005-03-15 at 06:15, New, Cecil (GE Trans) wrote:
> A long time ago (when Java 1.4 came out) someone suggested that the NIO
> package with memory mapped files might be used to address this sort of
> problem. I imagine it would be quite slow, however.
Right, I agree that it would be too slow. I think that approach also
rapidly degrades to just having the whole thing in RAM if your
code randomly accesses nodes throughout the tree. So, I am not
thinking of doing it that way.
I don't mean to turn this thread into a guessing game. I don't
want to explain too much, in part because I don't have everything
worked out yet. I just wanted to get a feel for whether other
people think that there is a need that would justify the effort
to actually build it.
Thanks,
jason!
--
P.S. You might also be interested in my latest project, ReadySET Pro.
http://www.readysetpro.com/
More information about the jdom-interest
mailing list