[jdom-interest] [PATCH] Provide surrogate pair support to jdom
Per Norrman
per.norrman at austers.se
Wed Aug 25 02:29:18 PDT 2004
As a consequence, XMLOutputter also needs to be patched. Currently,
there are no checks for surrogate pairs when escaping PCDATA content.
Try serializing a document that contains a supplementary unicode character
using for instance ISO-8859-1!
Anyhow, a patch for ficxing this is attached.
/pmn
Dave Byrne wrote:
> Below is a patch to provide decoding of surrogate pairs in
> Verifier.checkCharacterData. Currently if a surrogate pair is in a document,
> each half of the pair will be sent independently to Verifier.isXMLCharacter
> which will throw an IllegalDataException. This patch combines the surrogate
> pairs into a single character which passes the tests in
> Verifier.isXMLCharacter()
>
> The patch is against CVS from this morning.
>
-------------- next part --------------
Index: XMLOutputter.java
===================================================================
RCS file: /home/cvspublic/jdom/src/java/org/jdom/output/XMLOutputter.java,v
retrieving revision 1.110
diff -r1.110 XMLOutputter.java
1327a1328
> char highSurrogate = 0xD800;
1361c1362,1376
< entity = "&#x" + Integer.toHexString(ch) + ";";
---
> if (isHighSurrogate(ch)) {
> highSurrogate = ch;
> continue;
> } else if (isLowSurrogate(ch)) {
> entity = encodeSurrogatePair(highSurrogate, ch);
> if (buffer == null) {
> buffer = new StringBuffer(str.length() + 20);
> buffer.append(str.substring(0, i-1));
> }
> buffer.append(entity);
> entity = null;
> continue;
> } else {
> entity = "&#x" + Integer.toHexString(ch) + ";";
> }
1393a1409,1435
>
> /**
> * Return true if the character is a Unicode High Surrogate character
> */
> private boolean isHighSurrogate(char ch) {
> return ch >= 0xD800 && ch <= 0xDBFF;
> }
>
> /**
> * Return true if the character is a Unicode Low Surrogate character
> */
> private boolean isLowSurrogate(char ch) {
> return ch >= 0xDC00 && ch <= 0xDFFF;
> }
>
> /**
> * Convert a Unicode surrogate pair to a character reference
> * @param highSurrogate
> * @param lowSurrogate
> * @return
> */
> private String encodeSurrogatePair(char highSurrogate, char lowSurrogate) {
> int high = (highSurrogate & 0x7FF) << 10;
> int low = lowSurrogate - 0xDC00;
> int codePoint = (high | low) + 0x10000;
> return "&#x" + Integer.toHexString(codePoint) + ';';
> }
1408c1450
< String entity;
---
> String entity = null;
1409a1452
> char highSurrogate = 0xD800;
1432c1475,1489
< entity = "&#x" + Integer.toHexString(ch) + ";";
---
> if (isHighSurrogate(ch)) {
> highSurrogate = ch;
> continue;
> } else if (isLowSurrogate(ch)) {
> entity = encodeSurrogatePair(highSurrogate, ch);
> if (buffer == null) {
> buffer = new StringBuffer(str.length() + 20);
> buffer.append(str.substring(0, i-1));
> }
> buffer.append(entity);
> entity = null;
> continue;
> } else {
> entity = "&#x" + Integer.toHexString(ch) + ";";
> }
More information about the jdom-interest
mailing list