[jdom-interest] Feature Request
Chris B.
chris at tech.com.au
Thu Feb 19 19:51:19 PST 2004
Thanks for that! It gives me another one to try.
For what I'm doing I need the best HTML parser I can lay my hands on,
and the more cruddy HTML it can grok without blowing up, the happier I
will be. Do you think it is better than Neka and JTidy?
I'll also take whatever patches you've got as well for testing.
Dennis Sosnoski wrote:
> I'd suggest instead using TagSoup
> (http://www.ccil.org/~cowan/XML/tagsoup). It implements its own SAX2
> parser for HTML, so doesn't interfere with anything else in your
> system. The only downside I've noticed is that the handling it uses to
> turn HTML into XHTML can go berserk in some cases of real-world HTML,
> such as <script> and <style> elements within the <body> (it properly
> tries to force them into a <head> element, so you end up with multiple
> <head>s and <body>s). I've figured out how to easily patch it to get
> around some of these issues, so let me know if you run into problems.
>
> - Dennis
>
> Chris B. wrote:
>
>> Jeremy.Prellwitz at siras.com wrote:
>>
>>
>>
>>> It is not NekoHTML that i'm worried about.
>>>
>>
>>
>> I'm worried about it because I suspect I will have to do some major
>> work on either NekoHTML or JTidy for a project I'm working on, and I
>> want to understand the situation as clearly as possible, because if
>> that happens I *may* have an opportunity to fix Neko properly.
>>
>>
>>
>>> It is parsing regular XML documents in the same webapp.
>>>
>>
>>
>> According to the Neko web site....
>> " The Xerces2 implementation dynamically instantiates the default
>> parser configuration to construct parser objects via the Jar service
>> facility. The Jar file |nekohtmlXni.jar| contains a
>> |META-INF/services| file that is read by Xerces2 implementation for
>> this purpose."
>>
>> If I understand this correctly, if you don't use nekohtmlXni.jar,
>> then you won't have the problem?
>>
>>
>>
>>
>>> Basically, NekoHTML interferes with the
>>> creation of Xerces parsers'. When i create a SAXBuilder object, it
>>> creates a parser that is using the HTML configuration setup by
>>> NekoHTML.
>>> If I could create my own Xerces parser, and instantiate it with the
>>> specific standard configuration class that it needs, and then pass
>>> it into
>>> the constructor of the SAXBuilder object, then i don't have to worry
>>> about
>>> a the SAXBuilder object creating a parser on its own, that uses the
>>> HTML
>>> configuration setup by NekoHTML.
>>>
>>>
>>> -jeremy
>>>
>>>
>>>
>>> "Chris
>>> B."
>>> <chris at tech.com.a
>>>
>>> u> To
>>>
>>> Jeremy.Prellwitz at siras.com 02/19/2004
>>> 05:55 cc
>>> PM jdom-interest at jdom.org
>>>
>>> Subject Re: [jdom-interest]
>>> Feature Request
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> As much as I think its a good idea, how would it help you directly,
>>> since NekoHTML doesn't seem to conform to XMLReader? (Which seems to be
>>> its problem).
>>>
>>>
>>> Jeremy.Prellwitz at siras.com wrote:
>>>
>>>
>>>
>>>
>>>
>>>> This is what I was trying to describe, just without mentioning it as
>>>> specifically/consisely as you just did. I wouldn't have brought up
>>>> my own
>>>> little issue if I didn't think that passing in your own XMLReader
>>>> instance
>>>> could offer usefulness to others. It seems like a simple enough
>>>> change to
>>>> the SAXBuilder.java class, and conincidently, it would smooth out
>>>> my code
>>>>
>>>>
>>>
>>> a
>>>
>>>
>>>
>>>
>>>> little bit. :-)
>>>>
>>>> -jeremy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> It seems to me that supplying your own XMLReader is a sensible enough
>>>>> activity that it deserves a proper method or constructor in
>>>>> SAXBuilder
>>>>> to pass it in.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> "Chris B."
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> <chris at tech.com.a
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> u> To
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> Jason Hunter <jhunter at xquery.com>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> 02/19/2004
>>>> 05:00 cc
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> PM Jeremy.Prellwitz at siras.com,
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> jdom-interest at jdom.org
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> Subject
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> Re: [jdom-interest] Feature
>>>> Request
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Jason Hunter wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Sounds like nekohtml is being a Bad Citizen, but I think you can do
>>>>> exactly what you want by subclassing SAXBuilder and overriding
>>>>> createParser().
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> It seems to me that supplying your own XMLReader is a sensible enough
>>>> activity that it deserves a proper method or constructor in SAXBuilder
>>>> to pass it in.
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> To control your jdom-interest membership:
>>>> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> To control your jdom-interest membership:
>>> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>>
>>
>>
>>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com
>
More information about the jdom-interest
mailing list