[jdom-interest] Simple xhtml/entity resolver?
Oliver Ruebenacker
curoli at gmail.com
Thu Mar 29 09:54:45 PDT 2012
Hello,
Thanks for all the advice, but it seems I did not make myself
sufficiently clear.
My situation is this: some one else already parsed XHTML and gave me
the JDOM element that represents a fragment of it.
Let us say the original fragment looks something like this:
"<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
"<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
"<p><b>© 2012</b> by <em>Dewey, Cheetham Howe</em></p>"
I never get to see that fragment, but instead an object of type
Element. What I want to get is a String that looks roughly like this:
"© 2012 by Dewey, Cheetham & Howe"
A simple lightweight solution that is roughly acceptable in most
simple cases is fine for my purpose.
So I am trying a recursive method that iterates over
Element.getContent() and then I am wondering what to do if the content
happens to be EntityRef?
package cbit.vcell.model.summaries;
import org.jdom.Comment;
import org.jdom.DocType;
import org.jdom.Element;
import org.jdom.EntityRef;
import org.jdom.ProcessingInstruction;
import org.jdom.Text;
public class XHTMLToPlainTextConverter {
public static String convert(Element element) {
String text = "";
for(Object content : element.getContent()) {
if(content instanceof Comment) {
// ignore
} else if(content instanceof DocType) {
// ignore
} else if(content instanceof Element) {
Element childElement = (Element) content;
text = text + convert(childElement);
} else if(content instanceof EntityRef) {
EntityRef ref = (EntityRef) content;
text = text + ref; // ???
} else if(content instanceof ProcessingInstruction) {
// ignore
} else if(content instanceof Text) {
Text childText = (Text) content;
text = text + childText.getText();
} else {
// ignore, should not happen
}
}
return text;
}
}
Thanks!
Take care
Oliver
On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt <thechrispratt at gmail.com> wrote:
> Another option I've used in the past is changing the underlying SAX parser
> that jDOM uses to TagSoup ( http://ccil.org/~cowan/XML/tagsoup/). Their
> parser is tuned to parsing not fully XML compliant HTML.
>
> (*Chris*)
>
> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet
> <olivier.jaquemet at jalios.com> wrote:
>>
>> Hi Oliver,
>>
>> JDom is a great tool for parsing XML...
>>
>> ... but for XHTML fragment (which may not be completely XHTML compliant
>> ... ?)
>> and specially for text extraction, I would strongly suggest JSoup
>> http://jsoup.org/
>>
>> String text = org.jsoup.Jsoup.parse(html).text();
>>
>> Whatever is your html it will work like a charm (even it is an ugly copy
>> paste wysiwyg from word or any ugly html export from whatever website)
>>
>> Olivier
>>
>>
>> On 29/03/2012 15:23, Oliver Ruebenacker wrote:
>>>
>>> Hello,
>>>
>>> I need a simple way to convert some XHTML fragments, provided as a
>>> JDOM Element, into plain text. I am willing to ignore most HTML tags
>>> and consider only the most commonly used predefined entities.
>>>
>>> In JDOM, an entity reference has a name, a public id and a system
>>> id. I think I know what the named means, for named entities. But what
>>> about numeric entities, how do I get the code point? And what are
>>> public id and system id?
>>>
>>> Thanks!
>>>
>>> Take care
>>> Oliver
>>>
>>
>> --
>> Olivier Jaquemet<olivier.jaquemet at jalios.com>
>> Ingénieur R&D Jalios S.A. - http://www.jalios.com/
>> @OlivierJaquemet +33970461480
>>
>>
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>
>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
--
Oliver Ruebenacker, Computational Cell Biologist
Virtual Cell (http://vcell.org)
SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org)
http://www.oliver.curiousworld.org
More information about the jdom-interest
mailing list