[jdom-interest] What does the encoding really mean?

Elliotte Rusty Harold elharo at metalab.unc.edu
Wed Nov 28 05:07:25 PST 2001


Fred Clewis wrote:


>>  Suppose you have a UTF-8 (with multibyte encodings) XML file and parse it
>>  in to build a document and then output it to a unicode string in Java that
>>  perhaps you use MQSeries to send somewhere.   In the MQSeries transport it
>>  is described as CCSID 1200, unicode, and it is stored as twobyte unicode.
>>  The xml data still says encoding="UTF-8".  Well, at that moment in memory,
>>  that is untrue.   Is that OK?  Does the original encoding from file,
>>  "UTF-8", need to be preserved like this for some subsequent purpose?   Does
>>  it need to be changed to "UCS-2"?

Yes, this is OK; though less than ideal. In the event that encoding 
metadata from outside the XML document conflicts with the encoding 
declaration in the XML document, then the metadata wins. If MQSeries 
transport (whatever that is) says that the file is encoded in UCS-2, 
then a parser should recognize that, even if the encoding declaration 
says something different. This is one of the more obscure and 
less-known parts of the XML specification.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+



More information about the jdom-interest mailing list