[jdom-interest] Simple xhtml/entity resolver?

Thu Mar 29 09:54:45 PDT 2012

     Hello,

  Thanks for all the advice, but it seems I did not make myself
sufficiently clear.

  My situation is this: some one else already parsed XHTML and gave me
the JDOM element that represents a fragment of it.

  Let us say the original fragment looks something like this:

  "<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
  "<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
  "<p><b>&#x00a9; 2012</b> by <em>Dewey, Cheetham  Howe</em></p>"

  I never get to see that fragment, but instead an object of type
Element. What I want to get is a String that looks roughly like this:

  "© 2012 by Dewey, Cheetham & Howe"

  A simple lightweight solution that is roughly acceptable in most
simple cases is fine for my purpose.

  So I am trying a recursive method that iterates over
Element.getContent() and then I am wondering what to do if the content
happens to be EntityRef?

package cbit.vcell.model.summaries;

import org.jdom.Comment;
import org.jdom.DocType;
import org.jdom.Element;
import org.jdom.EntityRef;
import org.jdom.ProcessingInstruction;
import org.jdom.Text;

public class XHTMLToPlainTextConverter {

	public static String convert(Element element) {
		String text = "";
		for(Object content : element.getContent()) {
			if(content instanceof Comment) {
				// ignore
			} else if(content instanceof DocType) {
				// ignore
			} else if(content instanceof Element) {
				Element childElement = (Element) content;
				text = text + convert(childElement);
			} else if(content instanceof EntityRef) {
				EntityRef ref = (EntityRef) content;
				text = text + ref; // ???
			} else if(content instanceof ProcessingInstruction) {
				// ignore
			} else if(content instanceof Text) {
				Text childText = (Text) content;
				text = text + childText.getText();
			} else {
				// ignore, should not happen
			}
		}
		return text;
	}

}

  Thanks!

     Take care
     Oliver

On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt <thechrispratt at gmail.com> wrote:
> Another option I've used in the past is changing the underlying SAX parser
> that jDOM uses to TagSoup ( http://ccil.org/~cowan/XML/tagsoup/).  Their
> parser is tuned to parsing not fully XML compliant HTML.
>
>   (*Chris*)
>
> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet
> <olivier.jaquemet at jalios.com> wrote:
>>
>> Hi Oliver,
>>
>> JDom is a great tool for parsing XML...
>>
>> ... but for XHTML fragment (which may not be completely XHTML compliant
>> ... ?)
>> and specially for text extraction, I would strongly suggest JSoup
>> http://jsoup.org/
>>
>>  String text = org.jsoup.Jsoup.parse(html).text();
>>
>> Whatever is your html it will work like a charm (even it is an ugly copy
>> paste wysiwyg from word or any ugly html export from whatever website)
>>
>> Olivier
>>
>>
>> On 29/03/2012 15:23, Oliver Ruebenacker wrote:
>>>
>>>      Hello,
>>>
>>>   I need a simple way to convert some XHTML fragments, provided as a
>>> JDOM Element, into plain text. I am willing to ignore most HTML tags
>>> and consider only the most commonly used predefined entities.
>>>
>>>   In JDOM, an entity reference has a name, a public id and a system
>>> id. I think I know what the named means, for named entities. But what
>>> about numeric entities, how do I get the code point? And what are
>>> public id and system id?
>>>
>>>   Thanks!
>>>
>>>      Take care
>>>      Oliver
>>>
>>
>> --
>> Olivier Jaquemet<olivier.jaquemet at jalios.com>
>> Ingénieur R&D Jalios S.A. - http://www.jalios.com/
>> @OlivierJaquemet +33970461480
>>
>>
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>
>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com

-- 
Oliver Ruebenacker, Computational Cell Biologist
Virtual Cell (http://vcell.org)
SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org)
http://www.oliver.curiousworld.org