[jdom-interest] [PATCH] Provide surrogate pair support to jdom
Dave Byrne
dave-lists at intelligentendeavors.com
Tue Aug 24 09:58:19 PDT 2004
Below is a patch to provide decoding of surrogate pairs in
Verifier.checkCharacterData. Currently if a surrogate pair is in a document,
each half of the pair will be sent independently to Verifier.isXMLCharacter
which will throw an IllegalDataException. This patch combines the surrogate
pairs into a single character which passes the tests in
Verifier.isXMLCharacter()
The patch is against CVS from this morning.
Thanks
Dave Byrne
--- Verifier.old Fri Feb 6 01:28:30 2004
+++ Verifier.java Tue Aug 24 09:55:39 2004
@@ -137,7 +137,6 @@
* characters allowed by the XML 1.0 specification. The C0 controls
* (e.g. null, vertical tab, formfeed, etc.) are specifically excluded
* except for carriage return, linefeed, and the horizontal tab.
- * Surrogates are also excluded.
* <p>
* This method is useful for checking element content and attribute
* values. Note that characters
@@ -155,15 +154,41 @@
return "A null is not a legal XML value";
}
- // do check
- for (int i = 0, len = text.length(); i<len; i++) {
- if (!isXMLCharacter(text.charAt(i))) {
- // Likely this character can't be easily displayed
- // because it's a control so we use it'd hexadecimal
- // representation in the reason.
- return ("0x" + Integer.toHexString(text.charAt(i))
- + " is not a legal XML character");
- }
+
+ for(int i = 0; i < text.length(); i++) {
+
+ int ch = text.charAt(i);
+
+ if (ch >= 0xD800 && ch <= 0xDBFF) {
+ //encountered the first part of a surrogate pair
+ //make sure that the next char is the low-surrogate
+ char low;
+
+ try {
+ low = text.charAt(i + 1);
+ } catch(IndexOutOfBoundsException ex) {
+ return "Surrogate Pair Truncated";
+ }
+
+ if (low < 0xDC00 || low > 0xDFFF) {
+ //the low surrogate is not present
+ return "Illegal Surrogate Pair";
+ }
+ else {
+ //its a good pair, calculate the true value
of
+ //the character to then pass to
isXMLCharacter()
+ ch = 0x10000 + (ch - 0xD800) * 0x400 + (low
- 0xDC00);
+ i++;
+ }
+ }
+
+ if (!isXMLCharacter(ch)) {
+ // Likely this character can't be easily displayed
+ // because it's a control so we use it'd hexadecimal
+ // representation in the reason.
+ return ("0x" + Integer.toHexString(ch)
+ + " is not a legal XML character");
+ }
}
// If we got here, everything is OK @@ -715,11 +740,11 @@
* character is a character according to production 2 of the
* XML 1.0 specification.
*
- * @param c <code>char</code> to check for XML compliance
+ * @param c <code>int</code> to check for XML compliance
* @return <code>boolean</code> true if it's a character,
* false otherwise
*/
- private static boolean isXMLCharacter(char c) {
+ private static boolean isXMLCharacter(int c) {
if (c == '\n') return true;
if (c == '\r') return true;
More information about the jdom-interest
mailing list