[jdom-interest] Maximizing Effeciency of XPATH calls
Jason Hunter
jhunter at xquery.com
Fri Sep 9 13:59:57 PDT 2005
Paul Libbrecht wrote:
> I'd be interested to know how much performance one can expect of such
> engines... we keep processing XML but end-up storing and retrieving
> with Lucene which has real good performances (like 1500 queries/second
> with about 7000 items overall of 50Mb on a single PC, they are about 20
> typical queries).
>
> XML-databases that I try were way slower (by a factor of 10 or 100) but
> I never went till a real index.
>
> What does such a commercial produce as Mark Logic achieve ?
If your XML database is written in Java, you can pretty much assume it's
not going to be fast. I love Java, made my career off Java, and wrote
JDOM explicitly *for* Java, but you don't write fast databases in Java.
Fast databases need too much low-level control -- of the filesystem
and threading and memory management. I'll bet whatever you tried was
written in Java.
Mark Logic's engine is in C++. Thank God some people still remember
C++. :) It's designed with a search engine style architecture (similar
to Lucene) but with indexes that understand the structure of the
documents. To get that across, it helps to understand that in MarkLogic
Server a query of //foo is a wee bit faster than /a/b/foo. Thta's
because to satisfy the first query it only needs to know where <foo>
instances are, which it knows easily with its index. With the second it
needs to find <foo> instances but also use other indexes to make sure
it's under <b> which is under <a>. Both queries are very fast (the join
between indexes for the second query is fast). Thus starting to stream
answers basically instantaneously because they can be fully answered
with indexes.
I say "start to stream" above because to actually receive all <foo>
elements on a 5 Tera data set itself can take a while.
Other queries like //foo[title = "x"] are also fast for large data
because that query too can be fully resolved with indexes. So too with
//foo[cts:contains(title, "foo")] which looks for titles containing the
token "foo" (token-based like Lucene so it correctly doesn't match
"food"). The cts: namespace is a Mark Logic extension. Regular XQuery
doesn't have token-based fast text search yet standardized.
You can go beyond XPath in XQuery to do fancy things like
for $hit in
cts:search(//chapter[cts:contains(., "servlet")]/sect1/para,
cts:word-query("apply style", "stemmed"))[1 to 10]
return
<span>
<book>{ $para/ancestor::book/title/text() }</book>
{ htmllib:render($hit) }
</span>
This says to search all para elements under sect1 elements of all
chapters which have "servlet" in their titles (case insensitive) for the
phrase "apply style" with stemming enabled so "applying styles" would be
a legal match too, then return the top 10 most relevant. For each $hit
item return a <span> containing the reference of the book containing the
paragraph and underneath render the paragraph nicely as html (using a
user-defined function). O'Reilly's doing stuff like this with their
content using Mark Logic, against many gigs of book and article content.
I showed some demos at FOO Camp there that were fun.
If you want to see all the text search stuff, I wrote a paper titled
"XQuery Search and Update" for XTech 2005 available at
http://idealliance.org/proceedings/xtech05/papers/02-04-01/.
Anyway, as you can tell by my return address, I've been enjoying XQuery
quite a lot. :)
-jh-
More information about the jdom-interest
mailing list