[jdom-interest] Simple xhtml/entity resolver?
Rolf Lear
jdom at tuis.net
Thu Mar 29 08:35:44 PDT 2012
Discussing character escapes using a web-based mail client is probably not
the smartest thing I have done...
Especially complicated when replies are made, etc.
Sorry, but in the 'second example', should read:
> Having said that, you must understand that JDOM *expects* to be given
> 'un-escaped' data. If you tell JDOM to set the value for attribute
'attb'
> to the (expanded with a space to preserve formatting) String '& #169;'
then JDOM will do that, and, when you output the
> value, it will escape the '&' for you so that the value (expanded with a
space to preserve formatting) '& #169;' is
> preserved.... for example, if we add the following lines to the above
> program:
>
> doc.getRootElement().setAttribute("attb", "& #169;"); // expanded with
a space
> xout.output(doc, System.out);
>
> the output is now:
>
> ©
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" />
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" attb="& #169;" />
note how I have expanded the char escapes with a space to preserve
formatting... this may just make things more complicated... I don't know.
Rolf
On Thu, 29 Mar 2012 11:15:26 -0400, Rolf Lear <jdom at tuis.net> wrote:
> Ahh,
>
> In order to discuss the 'entity' processing, you need to be careful
about
> how you specify the 'location' of the data...
>
> For example, there are three basic 'locations' for content when we
> consider JDOM, the 'unparsed XML', the 'JDOM Document', and the
'output'.
>
> Also, when you say &169; do you mean &169; or do you actually mean ©
> ? There is a *big* difference....
>
> When you parse 'unparsed XML' the parser will always translate character
> escapes to the actual character, for example, © will become ©. JDOM
> will never see the '©'. If, for example, in the 'unparsed XML' file,
> you had <root att="&169;" />, then, when parsed and given to JDOM, you
will
> always have the single char © as root.getAttributeValue("att").
>
> When you output that value from JDOM, JDOM will use the 'charset' of the
> output destination to determine whether the © char needs to be escaped.
> For
> example, the following 'program':
>
> SAXBuilder builder = new SAXBuilder();
> Document doc = builder.build(new StringReader("<root att='©' />"));
> System.out.println(doc.getRootElement().getAttributeValue("att"));
> XMLOutputter xout = new XMLOutputter();
> xout.output(doc, System.out);
>
> outputs:
>
> ©
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" />
>
>
> Having said that, you must understand that JDOM *expects* to be given
> 'un-escaped' data. If you tell JDOM to set the value for attribute
'attb'
> to the String '©' then JDOM will do that, and, when you output the
> value, it will escape the '&' for you so that the value '©' is
> preserved.... for example, if we add the following lines to the above
> program:
>
> doc.getRootElement().setAttribute("attb", "©");
> xout.output(doc, System.out);
>
> the output is now:
>
> ©
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" />
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" attb="©" />
>
>
> So, making sure that we have a good understanding of the concept of
> character escapes, you must realize that they are *not*
EntityReferences...
> you should never see any JDOM object representing a character escape.
>
> On the other hand, if you had the entity reference '©' in your
> 'unparsed XML', the parser (by default) should have replaced it with the
> appropriate character(s) when the document was parsed. Again, JDOM will
see
> the character © and not the reference '©'. A 'default' parser will
> fail to parse a document if it has references that cannot be resolved.
If
> you change the default parse behaviour (to remove the entity-resolve
> process), then instead of the © character, you will have a JDOM
EntityRef
> with the name 'copy'.
>
> In other words, you have to go out of your way to create EntityRef
> instances. If you want to ignore the processes the parser uses to
resolve
> entities, then you will need to scan the JDOM tree, look for EntityRefs,
> and manually replace them with the appropriate Text.... using whatever
> strategy you want to use.
>
>
>
> In a more general answer to your original question 'how do I basically
> replace a browser', though, what you really want to be doing is a
Transform
> on your JDOM document, to create an appropriate output for your needs.
The
> transform you use will depend on what results you want. Have a look at
> XSLTransform class in JDOM, as well as the various resources on the net
for
> XSL Transformations.
>
>
> Rolf
>
>
>
> On Thu, 29 Mar 2012 10:28:26 -0400, Oliver Ruebenacker
<curoli at gmail.com>
> wrote:
>> Hello Rolf,
>>
>> I think there is a misunderstanding. I don't want to output as XML.
>> I want to render the XHTML as text like a very primitive browser would
>> display it.
>>
>> I'm building a String by traversing the tree by calling
>> Element.getContent(). For example, a © can be encoded in XML as
>> "©". Presumably, the Element tree would contain an EntityRef with
>> name "copy". But what if an XML document contains "&169;" or
>> "&x00A9;"? How would the EntityRef object look like?
>>
>> Thanks!
>>
>> Take care
>> Oliver
>>
>> On Thu, Mar 29, 2012 at 9:46 AM, Rolf Lear <jdom at tuis.net> wrote:
>>>
>>> Hi Oliver.
>>>
>>> If you already have the XHTML content as JDOM Elements, then you
should
>>> be
>>> able to (just) do:
>>>
>>> XMLOutputter xout = new XMLOutputter();
>>> String fragment = xout.outputString(element);
>>>
>>> If you want to change the format of the output (indenting, etc.), you
> can
>>> add a 'Format' to the XMLOutputter with:
>>>
>>> XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
>>> String fragment = xout.outputString(element);
>>>
>>>
>>> I think you may be chasing a red-herring with the Entity References.
>>>
>>> The EntityRef code is a 'CYA' implementation, but, in reality, the
>>> SystemID and PublicID are never going to be needed in regular usage.
>>>
>>> The only place I know of where you have entity references is if you
>>> specify your input parser should ignore entity-reference lookups when
>>> parsing, and in JDOM you will end up with an EntityRef instead of it's
>>> 'underlying' text.
>>>
>>> Rolf
>>>
>>>
More information about the jdom-interest
mailing list