[jdom-interest] [PATCH] Provide surrogate pair support to jdom

Per Norrman per.norrman at austers.se
Wed Aug 25 02:29:18 PDT 2004

As a consequence, XMLOutputter also needs to be patched. Currently,
there are no checks for surrogate pairs when escaping PCDATA content.
Try serializing a document that contains a supplementary unicode character
using for instance ISO-8859-1!

Anyhow, a patch for ficxing this is attached.


Dave Byrne wrote:

> Below is a patch to provide decoding of surrogate pairs in
> Verifier.checkCharacterData. Currently if a surrogate pair is in a document,
> each half of the pair will be sent independently to Verifier.isXMLCharacter
> which will throw an IllegalDataException.  This patch combines the surrogate
> pairs into a single character which passes the tests in
> Verifier.isXMLCharacter()
> The patch is against CVS from this morning.

-------------- next part --------------
Index: XMLOutputter.java
RCS file: /home/cvspublic/jdom/src/java/org/jdom/output/XMLOutputter.java,v
retrieving revision 1.110
diff -r1.110 XMLOutputter.java
>         char highSurrogate = 0xD800;
<                         entity = "&#x" + Integer.toHexString(ch) + ";";
>                         if (isHighSurrogate(ch)) {
>                             highSurrogate = ch;
>                             continue;
>                         } else if (isLowSurrogate(ch)) {
>                             entity = encodeSurrogatePair(highSurrogate, ch);
>                             if (buffer == null) {
>                                 buffer = new StringBuffer(str.length() + 20);
>                                 buffer.append(str.substring(0, i-1));
>                             }
>                             buffer.append(entity);
>                             entity = null;
>                             continue;
>                         } else {
>                             entity = "&#x" + Integer.toHexString(ch) + ";";
>                         }
>     /**
>      * Return true if the character is a Unicode High Surrogate character
>      */
>     private boolean isHighSurrogate(char ch) {
>         return ch >= 0xD800 && ch <= 0xDBFF;
>     }
>     /**
>      * Return true if the character is a Unicode Low Surrogate character
>      */
>     private boolean isLowSurrogate(char ch) {
>         return ch >= 0xDC00 && ch <= 0xDFFF;
>     }
>     /**
>      * Convert a Unicode surrogate pair to a character reference
>      * @param highSurrogate
>      * @param lowSurrogate
>      * @return
>      */
>     private String encodeSurrogatePair(char highSurrogate, char lowSurrogate) {
>         int high =  (highSurrogate & 0x7FF) << 10;
>         int low = lowSurrogate - 0xDC00;
>         int codePoint = (high | low) + 0x10000; 
>         return "&#x" + Integer.toHexString(codePoint) + ';';
>     }
<         String entity;
>         String entity = null;
>         char highSurrogate = 0xD800;
<                         entity = "&#x" + Integer.toHexString(ch) + ";";
>                         if (isHighSurrogate(ch)) {
>                             highSurrogate = ch;
>                             continue;
>                         } else if (isLowSurrogate(ch)) {
>                             entity = encodeSurrogatePair(highSurrogate, ch);
>                             if (buffer == null) {
>                                 buffer = new StringBuffer(str.length() + 20);
>                                 buffer.append(str.substring(0, i-1));
>                             }
>                             buffer.append(entity);
>                             entity = null;
>                             continue;
>                         } else {
>                             entity = "&#x" + Integer.toHexString(ch) + ";";
>                         }

More information about the jdom-interest mailing list