[jdom-interest] Feature Request

John Cowan cowan at ccil.org
Sat Feb 21 10:52:30 PST 2004


Dennis Sosnoski scripsit:

> I wanted HTML to be parsed with a SAX API (and without namespaces, so that 
> I could easily use XPath on the constructed model). 

Since there's demand for this, I'll make sure the SAX namespace and
namespace-prefix features work correctly on the next release.
Currently they can be set and cleared but don't change anything.

>        schema.elementType("span", Schema.M_ANY, Schema.M_ANY, 0);
>        schema.elementType("div", Schema.M_ANY, Schema.M_ANY, 0);
>        schema.elementType("table", Schema.M_ANY, Schema.M_ANY, 0);
>        schema.elementType("br", Schema.M_EMPTY, Schema.M_ANY, 0);

I'd be interested in knowing why these particular ones were important.
I understand the issue with script and style.

> John, I 
> should also mention that I ran into cases where the parser was not 
> clearing itself properly when starting a new parse, I think because the 
> Parser.theSaved field was not being set to null.

Thanks; I'll add that to the to-do list for 0.9.2.  (I forgot to test
for parser reusability; my tests always instantiate a new one.)

> >>The only downside I've noticed is that the handling it uses to 
> >>turn HTML into XHTML can go berserk in some cases of real-world HTML, 
> >>such as <script> and <style> elements within the <body> (it properly 
> >>tries to force them into a <head> element, so you end up with multiple 
> >><head>s and <body>s).

TagSoup's content models are implicitly of the form (A|B|C|...)*, so
it thinks the content model of the html element is (head|body)*.
I may do some special-casery to fix this, but probably not for 0.9.2
unless I see a very easy way to do it.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  jcowan at reutershealth.com
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.



More information about the jdom-interest mailing list