[jdom-interest] Re: Manipulating a very large XML file
Jason Hunter
jhunter at xquery.com
Wed Mar 16 22:22:42 PST 2005
Jason Robbins wrote:
> Oh, I agree. As a computer scientist, I know that both 8*N and N/2
> are both O(N), so from that point of view, it really doesn't matter.
> In the long run, as N continues to grow, people absolutely need
> to switch to a database approach.
Yes, exactly. I'm sure you also would agree that if you can change the
coefficient or the exponent, you do the exponent first. Change O(N) to
O(1) then worry about the performance of the O(1). That's why I would
encourage people interested in supporting larger XML quantities to
devote their efforts to the database approach.
However if that doesn't interest you, then yes there is improvement to
be had by changing the coefficient and it's something that does have
practical benefit.
>>Or if you want a commercial grade solution, look at Mark Logic. You can
>>get a 30 day trial that supports data sets up to 1G
>>(http://xqzone.marklogic.com). The official product goes four orders of
>>magnitude larger than that. It's really fun.
>
> Cool. Hardcore!
>
> If a dataset contains gigabytes, doesn't that make it more likely
> that the results of a given query could be tens of megabytes?
From my experience, result size isn't generally proportional to the
content set size.
O'Reilly for example is loading all their book and article content into
Mark Logic. It's a fair bit of content, but typical results are scoped
to the size appropriate for human consumption (multiples of chapters and
sections).
But yes, as you predict, some queries may return multi-megabyte answers.
Custom book printing is a good example where you output a large XSL-FO
document. From what I see, result size depends more on the nature of
the query rather than on the content set size.
> In a relational database, the RDBMS can return a large rowset
> as a stream, and the application goes through it row-by-row.
> If an XML query results in a big nodelist, that could certainly
> be streamed. But, if it results in a big sub-tree, doesn't
> that need to be represented in RAM in an efficient way?
Yes, absolutely, that's a case where a more memory efficient
implementation would come in handy. Normally XQuery returns a sequence
of XML nodes that you handle like you handle rows in a relational model
(pull based). The less memory it takes to handle each node, the more
effecient your handling and if any node is large...
In designing JDOM there's always been a tradeoff between features and
memory size. We've tried to strike a middle ground. It sounds like
you're thinking of preserving the features but changing the
representation, probably trading time for memory?
>>Here's a screencast I did with Jon Udell showing off XQuery to
>>manipulate some O'Reilly books in docbook format:
>> http://weblog.infoworld.com/udell/2005/02/15.html#a1177
>
> Very cool. I definitely need to learn more about xquery.
I think you'll enjoy it. Let me know if you have questions. There's a
vendor neutral mailing list at http://xquery.com and a Mark Logic
specific one at http://xqzone.marklogic.com.
-jh-
More information about the jdom-interest
mailing list