[jdom-interest] Parsing Microsoft Word Documents

Hugo Garcia hugo.a.garcia at gmail.com
Thu Dec 23 11:56:13 PST 2004


I didn't try jtidy since the realease is so old. I rahter wait on the
new release.  TagSoup didn't work becasue ti doesn't support
namespaces in order to use XPath.

NekoHTML parses the doument correctily yet when I see the result in
Firefox (Linux) the document looks funny. I suspect it might be the
characther set where  it is specified as windows but I am not sure. I
am using XPath to modify a clone of the input document.

Any input of your experience parsing the HTML generated from Microsoft
Word is welcome.


This is the intial code that sets things in motion:

	public void run() throws FitException {
		fixtureDocumentProccessor = new FixtureDocumentProcessor();
		Document fixtureDocument = null;
		try {
			SAXBuilder builder = new SAXBuilder("org.cyberneko.html.parsers.SAXParser");
					builder.setProperty("http://cyberneko.org/html/properties/names/elems",
"lower");
			builder.setFeature("http://cyberneko.org/html/features/override-doctype",
false);
			URL fileURL = inputFile.toURL();
			fixtureDocument = builder.build(fileURL);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (JDOMException e) {
			e.printStackTrace();
		}
		this.outputFitResults(fixtureDocumentProccessor.parse(fixtureDocument));
	}


-------------
-H


On Sat, 18 Dec 2004 11:14:11 +0000, Paul Reeves <p_a_reeves at hotmail.com> wrote:
> This isnt technically a jdom question....
> 
> Get hold of JTidy http://sourceforge.net/projects/jtidy or even better,
> nekohtml http://www.apache.org/~andyc/neko/doc/html/
> 
> Both will fix your unquotted attribute problem and also attempt to correct
> unbalanced tags - jtidy also has a "clean word" facility which is rather
> useful
> 
> Paul
> 
> >From: Hugo Garcia <hugo.a.garcia at gmail.com>
> >Reply-To: Hugo Garcia <hugo.a.garcia at gmail.com>
> >To: jdom-interest at jdom.org
> >Subject: [jdom-interest] Parsing Microsoft Word Documents
> >Date: Fri, 17 Dec 2004 11:56:57 -0500
> >
> >Hi
> >
> >I am trying to parse a Microsoft Wrod document with the SAXBuilder but
> >I get an error that attributes must be qouted. When I look at the
> >document I see that indeed some attibutes, especially in various meta
> >tags are not quoted. I wonder if anyone has run into this problem and
> >if so if you have a work around or solution.
> >
> >thanks
> >
> >-H
> >_______________________________________________
> >To control your jdom-interest membership:
> >http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
> 
>


More information about the jdom-interest mailing list