[jdom-interest] Feature Request
Dennis Sosnoski
dms at sosnoski.com
Thu Feb 19 21:22:43 PST 2004
The big issue is just that real-world HTML often cannot be transformed
directly into XHTML. In my case I didn't really want XHTML, though - I
wanted HTML to be parsed with a SAX API (and without namespaces, so that
I could easily use XPath on the constructed model). The code below shows
how I'm using it with dom4j. I force off namespaces (the HTMLSchema
subclass) and disable a number of the containment rules that it would
otherwise enforce (the schema.elementType() lines below), before passing
the configured parser to the dom4j SAXReader:
// create and configure document builder
XMLReader parser = new Parser();
Schema schema = new HTMLSchema() {
public String getURI() {
return "";
}
public String getPrefx() {
return "";
}
};
schema.elementType("span", Schema.M_ANY, Schema.M_ANY, 0);
schema.elementType("div", Schema.M_ANY, Schema.M_ANY, 0);
schema.elementType("script", Schema.M_ANY, Schema.M_ANY, 0);
schema.elementType("style", Schema.M_ANY, Schema.M_ANY, 0);
schema.elementType("table", Schema.M_ANY, Schema.M_ANY, 0);
schema.elementType("br", Schema.M_EMPTY, Schema.M_ANY, 0);
parser.setProperty
("http://www.ccil.org/~cowan/tagsoup/properties/schema",
schema);
m_reader = new SAXReader(parser, false);
I'm copying John Cowan, the author of TagSoup, on this so he can comment
on cleaner/easier ways of accomplishing the same type of thing. John, I
should also mention that I ran into cases where the parser was not
clearing itself properly when starting a new parse, I think because the
Parser.theSaved field was not being set to null.
- Dennis
Dennis M. Sosnoski
Enterprise Java, XML, and Web Services Support
http://www.sosnoski.com
Redmond, WA 425.885.7197
Chris B. wrote:
>Thanks for that! It gives me another one to try.
>
>For what I'm doing I need the best HTML parser I can lay my hands on,
>and the more cruddy HTML it can grok without blowing up, the happier I
>will be. Do you think it is better than Neka and JTidy?
>
>I'll also take whatever patches you've got as well for testing.
>
>Dennis Sosnoski wrote:
>
>
>
>>I'd suggest instead using TagSoup
>>(http://www.ccil.org/~cowan/XML/tagsoup). It implements its own SAX2
>>parser for HTML, so doesn't interfere with anything else in your
>>system. The only downside I've noticed is that the handling it uses to
>>turn HTML into XHTML can go berserk in some cases of real-world HTML,
>>such as <script> and <style> elements within the <body> (it properly
>>tries to force them into a <head> element, so you end up with multiple
>><head>s and <body>s). I've figured out how to easily patch it to get
>>around some of these issues, so let me know if you run into problems.
>>
>> - Dennis
>>
>>Chris B. wrote:
>>
>>
>>
>>>Jeremy.Prellwitz at siras.com wrote:
>>>
>>>
>>>
>>>
>>>
>>>>It is not NekoHTML that i'm worried about.
>>>>
>>>>
>>>>
>>>I'm worried about it because I suspect I will have to do some major
>>>work on either NekoHTML or JTidy for a project I'm working on, and I
>>>want to understand the situation as clearly as possible, because if
>>>that happens I *may* have an opportunity to fix Neko properly.
>>>
>>>
>>>
>>>
>>>
>>>>It is parsing regular XML documents in the same webapp.
>>>>
>>>>
>>>>
>>>According to the Neko web site....
>>>" The Xerces2 implementation dynamically instantiates the default
>>>parser configuration to construct parser objects via the Jar service
>>>facility. The Jar file |nekohtmlXni.jar| contains a
>>>|META-INF/services| file that is read by Xerces2 implementation for
>>>this purpose."
>>>
>>>If I understand this correctly, if you don't use nekohtmlXni.jar,
>>>then you won't have the problem?
>>>
>>>
>>>
>>>
>>>
>>>
>>>>Basically, NekoHTML interferes with the
>>>>creation of Xerces parsers'. When i create a SAXBuilder object, it
>>>>creates a parser that is using the HTML configuration setup by
>>>>NekoHTML.
>>>>If I could create my own Xerces parser, and instantiate it with the
>>>>specific standard configuration class that it needs, and then pass
>>>>it into
>>>>the constructor of the SAXBuilder object, then i don't have to worry
>>>>about
>>>>a the SAXBuilder object creating a parser on its own, that uses the
>>>>HTML
>>>>configuration setup by NekoHTML.
>>>>
>>>>
>>>>-jeremy
>>>>
>>>>
>>>>
>>>> "Chris
>>>>B."
>>>><chris at tech.com.a
>>>>
>>>>u> To
>>>>
>>>>Jeremy.Prellwitz at siras.com 02/19/2004
>>>>05:55 cc
>>>>PM jdom-interest at jdom.org
>>>>
>>>>Subject Re: [jdom-interest]
>>>>Feature Request
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>As much as I think its a good idea, how would it help you directly,
>>>>since NekoHTML doesn't seem to conform to XMLReader? (Which seems to be
>>>>its problem).
>>>>
>>>>
>>>>Jeremy.Prellwitz at siras.com wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>This is what I was trying to describe, just without mentioning it as
>>>>>specifically/consisely as you just did. I wouldn't have brought up
>>>>>my own
>>>>>little issue if I didn't think that passing in your own XMLReader
>>>>>instance
>>>>>could offer usefulness to others. It seems like a simple enough
>>>>>change to
>>>>>the SAXBuilder.java class, and conincidently, it would smooth out
>>>>>my code
>>>>>
>>>>>
>>>>>
>>>>>
>>>>a
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>little bit. :-)
>>>>>
>>>>>-jeremy
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>It seems to me that supplying your own XMLReader is a sensible enough
>>>>>>activity that it deserves a proper method or constructor in
>>>>>>SAXBuilder
>>>>>>to pass it in.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> "Chris B."
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> <chris at tech.com.a
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>u> To
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Jason Hunter <jhunter at xquery.com>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> 02/19/2004
>>>>>05:00 cc
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> PM Jeremy.Prellwitz at siras.com,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> jdom-interest at jdom.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>Subject
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Re: [jdom-interest] Feature
>>>>>Request
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Jason Hunter wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Sounds like nekohtml is being a Bad Citizen, but I think you can do
>>>>>>exactly what you want by subclassing SAXBuilder and overriding
>>>>>>createParser().
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>It seems to me that supplying your own XMLReader is a sensible enough
>>>>>activity that it deserves a proper method or constructor in SAXBuilder
>>>>>to pass it in.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>To control your jdom-interest membership:
>>>>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>_______________________________________________
>>>>To control your jdom-interest membership:
>>>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>_______________________________________________
>>>To control your jdom-interest membership:
>>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>>
>>>
>>>
>>>
>>>
>>>
>>_______________________________________________
>>To control your jdom-interest membership:
>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>
>>
>>
>
>
>
More information about the jdom-interest
mailing list