[jdom-interest] Parsing Microsoft Word Documents

Paul Reeves p_a_reeves at hotmail.com
Fri Dec 24 02:37:13 PST 2004


Hugo

There hasn't been an offical jtidy release for donkeys years but that doesnt 
mean it doesnt work! It is more than up to the task. I wouldn't hold your 
breath for a new release  in the next few months......

If you are using nekohtml i find that if you output the document by 
converting it back from a jdom document to a dom document and use an 
org.apache.xml.serialize.HTMLSerializer to output it, it usually comes out 
looking o.k.

merry chrimbo

Paul

>From: Hugo Garcia <hugo.a.garcia at gmail.com>
>Reply-To: Hugo Garcia <hugo.a.garcia at gmail.com>
>To: jdom-interest at jdom.org
>Subject: Re: [jdom-interest] Parsing Microsoft Word Documents
>Date: Thu, 23 Dec 2004 14:56:13 -0500
>
>I didn't try jtidy since the realease is so old. I rahter wait on the
>new release.  TagSoup didn't work becasue ti doesn't support
>namespaces in order to use XPath.
>
>NekoHTML parses the doument correctily yet when I see the result in
>Firefox (Linux) the document looks funny. I suspect it might be the
>characther set where  it is specified as windows but I am not sure. I
>am using XPath to modify a clone of the input document.
>
>Any input of your experience parsing the HTML generated from Microsoft
>Word is welcome.
>
>
>This is the intial code that sets things in motion:
>
>	public void run() throws FitException {
>		fixtureDocumentProccessor = new FixtureDocumentProcessor();
>		Document fixtureDocument = null;
>		try {
>			SAXBuilder builder = new 
>SAXBuilder("org.cyberneko.html.parsers.SAXParser");
>					builder.setProperty("http://cyberneko.org/html/properties/names/elems",
>"lower");
>			builder.setFeature("http://cyberneko.org/html/features/override-doctype",
>false);
>			URL fileURL = inputFile.toURL();
>			fixtureDocument = builder.build(fileURL);
>		} catch (IOException e) {
>			e.printStackTrace();
>		} catch (JDOMException e) {
>			e.printStackTrace();
>		}
>		this.outputFitResults(fixtureDocumentProccessor.parse(fixtureDocument));
>	}
>
>
>-------------
>-H
>
>
>On Sat, 18 Dec 2004 11:14:11 +0000, Paul Reeves <p_a_reeves at hotmail.com> 
>wrote:
> > This isnt technically a jdom question....
> >
> > Get hold of JTidy http://sourceforge.net/projects/jtidy or even better,
> > nekohtml http://www.apache.org/~andyc/neko/doc/html/
> >
> > Both will fix your unquotted attribute problem and also attempt to 
>correct
> > unbalanced tags - jtidy also has a "clean word" facility which is rather
> > useful
> >
> > Paul
> >
> > >From: Hugo Garcia <hugo.a.garcia at gmail.com>
> > >Reply-To: Hugo Garcia <hugo.a.garcia at gmail.com>
> > >To: jdom-interest at jdom.org
> > >Subject: [jdom-interest] Parsing Microsoft Word Documents
> > >Date: Fri, 17 Dec 2004 11:56:57 -0500
> > >
> > >Hi
> > >
> > >I am trying to parse a Microsoft Wrod document with the SAXBuilder but
> > >I get an error that attributes must be qouted. When I look at the
> > >document I see that indeed some attibutes, especially in various meta
> > >tags are not quoted. I wonder if anyone has run into this problem and
> > >if so if you have a work around or solution.
> > >
> > >thanks
> > >
> > >-H
> > >_______________________________________________
> > >To control your jdom-interest membership:
> > >http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
> >
> >
>_______________________________________________
>To control your jdom-interest membership:
>http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com




More information about the jdom-interest mailing list