[jdom-interest] XMLOutputter problems with Unicode
Ian Lea
ian at digimem.net
Wed Jul 3 03:04:42 PDT 2002
FileWriter uses the default encoding. Try using
OutputStreamWriter and FileOutputStream.
--
Ian.
ian at digimem.net
> madeinstein at hotmail.com (Mad Einstein) wrote
>
> I tryied to do this like that:
>
> Element root = new Element("indexes");
> Document doc = new Document(root); //sample JDom Document
>
> FileWriter fw = new FileWriter("test.xml",false);
> new XMLOutputter(" ", true, "UTF-8").output(doc,fw);
>
> And the result was as I said one byte 93hex insead of \u8220
>
> Should I use different writer? Do you know any writers that will give me
> proper Unicode output?
>
> Thanks,
>
> Mad Einstein
>
> ----- Original Message -----
> From: "Jason Hunter" <jhunter at servlets.com>
> To: "Mad Einstein" <madeinstein at hotmail.com>
> Cc: <jdom-interest at jdom.org>
> Sent: Tuesday, July 02, 2002 8:06 PM
> Subject: Re: [jdom-interest] XMLOutputter problems with Unicode
>
>
> > Your solution is one approach. However, if you simply leave the
> > outputter's encoding as UTF-8 (the default) and pass in an output stream
> > or a writer designed for UTF-8, then characters are encoded correctly
> > without needing to be escaped. That should be faster than your
> > solution. If you don't see that happening, you probably passed in an
> > improper writer or changed the encoding.
> >
> > -jh-
> >
> > > Mad Einstein wrote:
> > >
> > > ???
> > > Current XMLOutputter class (Version 8) doesn't support Unicode
> > > characters with hashcode above 128.
> > >
> > > I was trying to save this character \u8220 to xml using XMLOutputter
> > > and as the result I had in file one byte (93hex) instead of two bytes,
> > > and then I couldn't parse this file using SAXBuilder as well as I
> > > couldn't open this file in Internet Explorer.
> > >
> > > I was reading different algorithms that converts Unicode to XML, HTML
> > > and I think this one is the best
> > >
> > > ----------------------------------------------------------------------
> > > http://czyborra.com/utf/#UTF-8
> > >
> > > HTML's Numerical Character References
> > >
> > > A somewhat more standardized encoding option is specified by HTML. RFC
> > > 2070 allows us to reference just any Unicode character within any HTML
> > > document of any charset by using the decimal numeric character
> > > reference ? as in:
> > >
> > > putwchar(c)
> > > {
> > > if (c < 0x80 && c != '&' && c != '<') putchar(c);
> > > else printf ("&#%d;", c);
> > > }
> > >
> > > Decimal numbers for Unicode characters are also used in Windows NT's
> > > Alt-12345 input method but are still of so little mnemonic value that
> > > a hexadecimal alternative ? is being supported by the newer
> > > standards HTML 4.0 and XML 1.0. Apart from that, hexadecimal numbers
> > > aren't that easy to memorize either. SGML has long allowed symbolic
> > > character entities for some character references like ? for ??
> > > and ? for the ?,? but the table of supported entities differs
> > > from browser to browser.
> > >
> > > ----------------------------------------------------------------------
> > >
> > > I wrote this method for the conversion
> > >
> > > This class converts this 3 characters (&,<,>) to SGML Entities as well
> > > as all characters above 128 using this format ? Now it works
> > > with any parsers suporting XML 1.0
> > >
> > > /**
> > > * Converts Unicode Character to HTML Decimal Entity.
> > > * All Characters with hashcode less than 128(decimal) apart from
> > > * '>','<' and '&' are the same.. The rest is converted to decimal
> > > entity &#{char_hashcode};
> > > * Supported formats examples:
> > > * <br> /u003F --> ?
> > > * @param value Unicode Character
> > > * @return Converted HTML Character or Entity.
> > > */
> > > public String convertTEXTtoHTML(char value)
> > > {
> > > String temp = null;
> > > char b[] = new char[1];
> > > int bint = new Character(value).hashCode();
> > >
> > >
> if((bint<128)&&(bint!="&".hashCode())&&(bint!="<".hashCode())&&(bint!=">".ha
> shCode()))
> > > {
> > > // b[0] = value;
> > > // temp = new String(b);
> > > temp = null;
> > > }
> > > else
> > > temp = "&#"+ bint +";";
> > > return temp;
> > > }
> > >
> > > and I changed XMLOutputter.escapeElementEntities(String str) method
> > >
> > > default :
> > > entity = convertTEXTtoHTML(ch);
> > > break;
> > >
> > > Maybe there is a different solution for this problem, but It works
> > > fine.
> > >
> > > Mad Einstein
----------------------------------------------------------------------
Searchable personal storage and archiving from http://www.digimem.net/
More information about the jdom-interest
mailing list