[jdom-interest] Feature Request

Thu Feb 19 21:22:43 PST 2004

The big issue is just that real-world HTML often cannot be transformed 
directly into XHTML. In my case I didn't really want XHTML, though - I 
wanted HTML to be parsed with a SAX API (and without namespaces, so that 
I could easily use XPath on the constructed model). The code below shows 
how I'm using it with dom4j. I force off namespaces (the HTMLSchema 
subclass) and disable a number of the containment rules that it would 
otherwise enforce (the schema.elementType() lines below), before passing 
the configured parser to the dom4j SAXReader:

        // create and configure document builder
        XMLReader parser = new Parser();
        Schema schema = new HTMLSchema() {
            public String getURI() {
                return "";
            }
            public String getPrefx() {
                return "";
            }
        };
        schema.elementType("span", Schema.M_ANY, Schema.M_ANY, 0);
        schema.elementType("div", Schema.M_ANY, Schema.M_ANY, 0);
        schema.elementType("script", Schema.M_ANY, Schema.M_ANY, 0);
        schema.elementType("style", Schema.M_ANY, Schema.M_ANY, 0);
        schema.elementType("table", Schema.M_ANY, Schema.M_ANY, 0);
        schema.elementType("br", Schema.M_EMPTY, Schema.M_ANY, 0);
        parser.setProperty
            ("http://www.ccil.org/~cowan/tagsoup/properties/schema", 
schema);
        m_reader = new SAXReader(parser, false);

I'm copying John Cowan, the author of TagSoup, on this so he can comment 
on cleaner/easier ways of accomplishing the same type of thing. John, I 
should also mention that I ran into cases where the parser was not 
clearing itself properly when starting a new parse, I think because the 
Parser.theSaved field was not being set to null.

  - Dennis

Dennis M. Sosnoski
Enterprise Java, XML, and Web Services Support
http://www.sosnoski.com
Redmond, WA  425.885.7197

Chris B. wrote:

>Thanks for that! It gives me another one to try.
>
>For what I'm doing I need the best HTML parser I can lay my hands on, 
>and the more cruddy HTML it can grok without blowing up, the happier I 
>will be. Do you think it is better than Neka and JTidy?
>
>I'll also take whatever patches you've got as well for testing.
>
>Dennis Sosnoski wrote:
>
>  
>
>>I'd suggest instead using TagSoup 
>>(http://www.ccil.org/~cowan/XML/tagsoup). It implements its own SAX2 
>>parser for HTML, so doesn't interfere with anything else in your 
>>system. The only downside I've noticed is that the handling it uses to 
>>turn HTML into XHTML can go berserk in some cases of real-world HTML, 
>>such as <script> and <style> elements within the <body> (it properly 
>>tries to force them into a <head> element, so you end up with multiple 
>><head>s and <body>s). I've figured out how to easily patch it to get 
>>around some of these issues, so let me know if you run into problems.
>>
>> - Dennis
>>
>>Chris B. wrote:
>>
>>    
>>
>>>Jeremy.Prellwitz at siras.com wrote:
>>>
>>> 
>>>
>>>      
>>>
>>>>It is not NekoHTML that i'm worried about.
>>>>  
>>>>        
>>>>
>>>I'm worried about it because I suspect I will have to do some major 
>>>work on either NekoHTML or JTidy for a project I'm working on, and I 
>>>want to understand the situation as clearly as possible, because if 
>>>that happens I *may* have an opportunity to fix Neko properly.
>>>
>>> 
>>>
>>>      
>>>
>>>>It is parsing regular XML documents in the same webapp. 
>>>>  
>>>>        
>>>>
>>>According to the Neko web site....
>>>" The Xerces2 implementation dynamically instantiates the default 
>>>parser configuration to construct parser objects via the Jar service 
>>>facility. The Jar file |nekohtmlXni.jar| contains a 
>>>|META-INF/services| file that is read by Xerces2 implementation for 
>>>this purpose."
>>>
>>>If I understand this correctly, if you don't use nekohtmlXni.jar, 
>>>then you won't have the problem?
>>>
>>>
>>> 
>>>
>>>      
>>>
>>>>Basically, NekoHTML interferes with the
>>>>creation of Xerces parsers'.    When i create a SAXBuilder object, it
>>>>creates a parser that is using the HTML configuration setup by 
>>>>NekoHTML.
>>>>If I could create my own Xerces parser, and instantiate it with the
>>>>specific standard configuration class that it needs, and then pass 
>>>>it into
>>>>the constructor of the SAXBuilder object, then i don't have to worry 
>>>>about
>>>>a the SAXBuilder object creating a parser on its own, that uses the 
>>>>HTML
>>>>configuration setup by NekoHTML.
>>>>
>>>>
>>>>-jeremy
>>>>
>>>>
>>>>                                                                         
>>>>           "Chris 
>>>>B."                                                               
>>>><chris at tech.com.a                                             
>>>>           
>>>>u>                                                         To 
>>>>                                     
>>>>Jeremy.Prellwitz at siras.com                     02/19/2004 
>>>>05:55                                           cc            
>>>>PM                        jdom-interest at jdom.org              
>>>>                                                                 
>>>>Subject                                      Re: [jdom-interest] 
>>>>Feature Request 
>>>>                                                                         
>>>>                                                                         
>>>>                                                                         
>>>>                                                                         
>>>>                                                                         
>>>>                                                                         
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>As much as I think its a good idea, how would it help you directly,
>>>>since NekoHTML doesn't seem to conform to XMLReader? (Which seems to be
>>>>its problem).
>>>>
>>>>
>>>>Jeremy.Prellwitz at siras.com wrote:
>>>>
>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>This is what I was trying to describe, just without mentioning it as
>>>>>specifically/consisely as you just did.  I wouldn't have brought up 
>>>>>my own
>>>>>little issue if I didn't think that passing in your own XMLReader 
>>>>>instance
>>>>>could offer usefulness to others.  It seems like a simple enough 
>>>>>change to
>>>>>the SAXBuilder.java class, and conincidently, it would smooth out 
>>>>>my code
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>a
>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>little bit. :-)
>>>>>
>>>>>-jeremy
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>>>It seems to me that supplying your own XMLReader is a sensible enough
>>>>>>activity that it deserves a proper method or constructor in 
>>>>>>SAXBuilder
>>>>>>to pass it in.
>>>>>>
>>>>>>
>>>>>>   
>>>>>>      
>>>>>>            
>>>>>>
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>          "Chris B."
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>          <chris at tech.com.a
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>          
>>>>>u>                                                         To
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>                                    Jason Hunter <jhunter at xquery.com>
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>          02/19/2004 
>>>>>05:00                                           cc
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>          PM                        Jeremy.Prellwitz at siras.com,
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>                                    jdom-interest at jdom.org
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>                                                                
>>>>>Subject
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>                                    Re: [jdom-interest] Feature 
>>>>>Request
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>Jason Hunter wrote:
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>>>Sounds like nekohtml is being a Bad Citizen, but I think you can do
>>>>>>exactly what you want by subclassing SAXBuilder and overriding
>>>>>>createParser().
>>>>>>
>>>>>>
>>>>>>   
>>>>>>      
>>>>>>            
>>>>>>
>>>>>It seems to me that supplying your own XMLReader is a sensible enough
>>>>>activity that it deserves a proper method or constructor in SAXBuilder
>>>>>to pass it in.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>To control your jdom-interest membership:
>>>>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>>>>
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>> 
>>>>>    
>>>>>          
>>>>>
>>>>_______________________________________________
>>>>To control your jdom-interest membership:
>>>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>>>
>>>>
>>>>
>>>>  
>>>>        
>>>>
>>>_______________________________________________
>>>To control your jdom-interest membership:
>>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>>
>>>
>>> 
>>>
>>>      
>>>
>>_______________________________________________
>>To control your jdom-interest membership:
>>http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>
>>    
>>
>
>  
>