[jdom-interest] Simple xhtml/entity resolver?

Rolf Lear jdom at tuis.net
Thu Mar 29 08:35:44 PDT 2012


Discussing character escapes using a web-based mail client is probably not
the smartest thing I have done...

Especially complicated when replies are made, etc.

Sorry, but in the 'second example', should read:

> Having said that, you must understand that JDOM *expects* to be given
> 'un-escaped' data. If you tell JDOM to set the value for attribute
'attb'
> to the (expanded with a space to preserve formatting)  String '& #169;'
then JDOM will do that, and, when you output the
> value, it will escape the '&' for you so that the value (expanded with a
space to preserve formatting) '& #169;' is
> preserved.... for example, if we add the following lines to the above
> program:
> 
> 		doc.getRootElement().setAttribute("attb", "& #169;"); // expanded with
a space
> 		xout.output(doc, System.out);
> 
> the output is now:
> 
> ©
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" />
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" attb="& #169;" />

note how I have expanded the char escapes with a space to preserve
formatting... this may just make things more complicated... I don't know.

Rolf


On Thu, 29 Mar 2012 11:15:26 -0400, Rolf Lear <jdom at tuis.net> wrote:
> Ahh,
> 
> In order to discuss the 'entity' processing, you need to be careful
about
> how you specify the 'location' of the data...
> 
> For example, there are three basic 'locations' for content when we
> consider JDOM, the 'unparsed XML', the 'JDOM Document', and the
'output'.
> 
> Also, when you say &169; do you mean &169; or do you actually mean © 
> ? There is a *big* difference....
> 
> When you parse 'unparsed XML' the parser will always translate character
> escapes to the actual character, for example, © will become ©. JDOM
> will never see the '©'. If, for example, in the 'unparsed XML' file,
> you had <root att="&169;" />, then, when parsed and given to JDOM, you
will
> always have the single char © as root.getAttributeValue("att").
> 
> When you output that value from JDOM, JDOM will use the 'charset' of the
> output destination to determine whether the © char needs to be escaped.
> For
> example, the following 'program':
> 
> 		SAXBuilder builder = new SAXBuilder();
> 		Document doc = builder.build(new StringReader("<root att='©' />"));
> 		System.out.println(doc.getRootElement().getAttributeValue("att"));
> 		XMLOutputter xout = new XMLOutputter();
> 		xout.output(doc, System.out);
> 
> outputs:
> 
> ©
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" />
> 
> 
> Having said that, you must understand that JDOM *expects* to be given
> 'un-escaped' data. If you tell JDOM to set the value for attribute
'attb'
> to the String '©' then JDOM will do that, and, when you output the
> value, it will escape the '&' for you so that the value '©' is
> preserved.... for example, if we add the following lines to the above
> program:
> 
> 		doc.getRootElement().setAttribute("attb", "©");
> 		xout.output(doc, System.out);
> 
> the output is now:
> 
> ©
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" />
> <?xml version="1.0" encoding="UTF-8"?>
> <root att="©" attb="©" />
> 
> 
> So, making sure that we have a good understanding of the concept of
> character escapes, you must realize that they are *not*
EntityReferences...
> you should never see any JDOM object representing a character escape.
> 
> On the other hand, if you had the entity reference '©' in your
> 'unparsed XML', the parser (by default) should have replaced it with the
> appropriate character(s) when the document was parsed. Again, JDOM will
see
> the character © and not the reference '©'. A 'default' parser will
> fail to parse a document if it has references that cannot be resolved.
If
> you change the default parse behaviour (to remove the entity-resolve
> process), then instead of the © character, you will have a JDOM
EntityRef
> with the name 'copy'.
> 
> In other words, you have to go out of your way to create EntityRef
> instances. If you want to ignore the processes the parser uses to
resolve
> entities, then you will need to scan the JDOM tree, look for EntityRefs,
> and manually replace them with the appropriate Text.... using whatever
> strategy you want to use.
> 
> 
> 
> In a more general answer to your original question 'how do I basically
> replace a browser', though, what you really want to be doing is a
Transform
> on your JDOM document, to create an appropriate output for your needs.
The
> transform you use will depend on what results you want. Have a look at
> XSLTransform class in JDOM, as well as the various resources on the net
for
> XSL Transformations.
> 
> 
> Rolf
> 
> 
> 
> On Thu, 29 Mar 2012 10:28:26 -0400, Oliver Ruebenacker
<curoli at gmail.com>
> wrote:
>> Hello Rolf,
>> 
>>   I think there is a misunderstanding. I don't want to output as XML.
>> I want to render the XHTML as text like a very primitive browser would
>> display it.
>> 
>>   I'm building a String by traversing the tree by calling
>> Element.getContent(). For example, a © can be encoded in XML as
>> "©". Presumably, the Element tree would contain an EntityRef with
>> name "copy". But what if an XML document contains "&169;" or
>> "&x00A9;"? How would the EntityRef object look like?
>> 
>>   Thanks!
>> 
>>      Take care
>>      Oliver
>> 
>> On Thu, Mar 29, 2012 at 9:46 AM, Rolf Lear <jdom at tuis.net> wrote:
>>>
>>> Hi Oliver.
>>>
>>> If you already have the XHTML content as JDOM Elements, then you
should
>>> be
>>> able to (just) do:
>>>
>>> XMLOutputter xout = new XMLOutputter();
>>> String fragment = xout.outputString(element);
>>>
>>> If you want to change the format of the output (indenting, etc.), you
> can
>>> add a 'Format' to the XMLOutputter with:
>>>
>>> XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
>>> String fragment = xout.outputString(element);
>>>
>>>
>>> I think you may be chasing a red-herring with the Entity References.
>>>
>>> The EntityRef code is a 'CYA' implementation, but, in reality, the
>>> SystemID and PublicID are never going to be needed in regular usage.
>>>
>>> The only place I know of where you have entity references is if you
>>> specify your input parser should ignore entity-reference lookups when
>>> parsing, and in JDOM you will end up with an EntityRef instead of it's
>>> 'underlying' text.
>>>
>>> Rolf
>>>
>>>


More information about the jdom-interest mailing list