[jdom-interest] Simple xhtml/entity resolver?
Paul Libbrecht
paul at hoplahup.net
Thu Mar 29 14:24:26 PDT 2012
Oliver,
I'm curious, did you ever get an entityRef?
To my experience, no SAXBuilder gives you them... Also, they will transform any numeric reference to a character.
Now, still, I tried to respond to your request and I could not.
Watching the XMLOutputter, I saw that it was actually outputting the entity ref itself (namely: the ampersand, the name, a semicolon), and indeed the EntityRef object does not carry any information that allows you to "resolve it".
The last step, entity-resolution, actually is the business of the DTD.
The Entity-references for xhtml are among the reasons of the xhtml dtd's enormous weight. If I remember well, mathml has an entity-definition-table that may be easier to process (also available as xml in case).
Also, beware if you want to parse XHTML:
- with a DTD, and without some "public/private catalog", you get a DTD loaded from W3C very slowly (and denying after a while)
- without it, all entity-references are broken.
... maybe you don't parse it?
All in all, could I conjecture the entity-ref objects are actually programmatically created? If yes, you need to expand them as a programme using the table mentioned above (could be a nice contrib).
hope it helps.
paul
Le 29 mars 2012 à 18:54, Oliver Ruebenacker a écrit :
> Hello,
>
> Thanks for all the advice, but it seems I did not make myself
> sufficiently clear.
>
> My situation is this: some one else already parsed XHTML and gave me
> the JDOM element that represents a fragment of it.
>
> Let us say the original fragment looks something like this:
>
> "<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
> "<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
> "<p><b>© 2012</b> by <em>Dewey, Cheetham Howe</em></p>"
>
> I never get to see that fragment, but instead an object of type
> Element. What I want to get is a String that looks roughly like this:
>
> "© 2012 by Dewey, Cheetham & Howe"
>
> A simple lightweight solution that is roughly acceptable in most
> simple cases is fine for my purpose.
>
> So I am trying a recursive method that iterates over
> Element.getContent() and then I am wondering what to do if the content
> happens to be EntityRef?
>
> package cbit.vcell.model.summaries;
>
> import org.jdom.Comment;
> import org.jdom.DocType;
> import org.jdom.Element;
> import org.jdom.EntityRef;
> import org.jdom.ProcessingInstruction;
> import org.jdom.Text;
>
> public class XHTMLToPlainTextConverter {
>
> public static String convert(Element element) {
> String text = "";
> for(Object content : element.getContent()) {
> if(content instanceof Comment) {
> // ignore
> } else if(content instanceof DocType) {
> // ignore
> } else if(content instanceof Element) {
> Element childElement = (Element) content;
> text = text + convert(childElement);
> } else if(content instanceof EntityRef) {
> EntityRef ref = (EntityRef) content;
> text = text + ref; // ???
> } else if(content instanceof ProcessingInstruction) {
> // ignore
> } else if(content instanceof Text) {
> Text childText = (Text) content;
> text = text + childText.getText();
> } else {
> // ignore, should not happen
> }
> }
> return text;
> }
>
> }
>
> Thanks!
>
> Take care
> Oliver
>
> On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt <thechrispratt at gmail.com> wrote:
>> Another option I've used in the past is changing the underlying SAX parser
>> that jDOM uses to TagSoup ( http://ccil.org/~cowan/XML/tagsoup/). Their
>> parser is tuned to parsing not fully XML compliant HTML.
>>
>> (*Chris*)
>>
>> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet
>> <olivier.jaquemet at jalios.com> wrote:
>>>
>>> Hi Oliver,
>>>
>>> JDom is a great tool for parsing XML...
>>>
>>> ... but for XHTML fragment (which may not be completely XHTML compliant
>>> ... ?)
>>> and specially for text extraction, I would strongly suggest JSoup
>>> http://jsoup.org/
>>>
>>> String text = org.jsoup.Jsoup.parse(html).text();
>>>
>>> Whatever is your html it will work like a charm (even it is an ugly copy
>>> paste wysiwyg from word or any ugly html export from whatever website)
>>>
>>> Olivier
>>>
>>>
>>> On 29/03/2012 15:23, Oliver Ruebenacker wrote:
>>>>
>>>> Hello,
>>>>
>>>> I need a simple way to convert some XHTML fragments, provided as a
>>>> JDOM Element, into plain text. I am willing to ignore most HTML tags
>>>> and consider only the most commonly used predefined entities.
>>>>
>>>> In JDOM, an entity reference has a name, a public id and a system
>>>> id. I think I know what the named means, for named entities. But what
>>>> about numeric entities, how do I get the code point? And what are
>>>> public id and system id?
>>>>
>>>> Thanks!
>>>>
>>>> Take care
>>>> Oliver
>>>>
>>>
>>> --
>>> Olivier Jaquemet<olivier.jaquemet at jalios.com>
>>> Ingénieur R&D Jalios S.A. - http://www.jalios.com/
>>> @OlivierJaquemet +33970461480
>>>
>>>
>>>
>>> _______________________________________________
>>> To control your jdom-interest membership:
>>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>>
>>
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>
>
>
> --
> Oliver Ruebenacker, Computational Cell Biologist
> Virtual Cell (http://vcell.org)
> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org)
> http://www.oliver.curiousworld.org
>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
More information about the jdom-interest
mailing list