[jdom-interest] special characters problem

Pramodh Peddi peddip at contextmedia.com
Tue Nov 4 08:58:36 PST 2003


Hi,
Unfortunately my problem is not solved yet!
Initially I thought it is the JDOM parser which is causing the distortion of
the soecial chars. But I figured that Transformation is causing this (of
course transformer uses parser internally!).
This problem is occuring ONLY on UNIX machines. It is working fine on
Windows machines.
I am using Java1.4.1's Transformer to transform my xml files. The xml file
has ™ (TM symbol). And it spits out some funky chars in place of this
TM character. There are many such special chars in the source xml file (like
copyright, registered mark, etc).

In the application, I am trying to preserve the encoding. If I don't do
that, it is spitting out "?" marks in place of special chars.

I would appreciate any help. I am desperate for any advices and suggestions.

Source xml looks like this (i extracted a part of it to make it small):
****************source xml*********************
<content>
    <category name = "MavicaCLIE&#8482;"/>
</content>
*********************************************

This is what i get out of the transformation process:
---------------------------------------------------------------------
<content>
    <container>MavicaCLIEâÂ"¢</container>
</content>
--------------------------------------------------------------------

And the stylesheet looks like following:
=============================================
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="/product_metadata">

<content>

   <container>

    <xsl:value-of select="category/@name" disable-output-escaping="yes"/>

</container>

</content>

</xsl:template>

</xsl:stylesheet>

==============================================

This is what I am doing in the application (source xml file has
"windows-1252" encoding):
**********************************************************************
ByteArrayOutputStream rawfileOutputStream =

new ByteArrayOutputStream();


// get the file

filePath = (String) taxKeyIt.next();

file = (SftpFile) taxFileMap.get(filePath);

filePath = ServiceUtils.getFileNameFromURL(filePath);

if (filePath != null) {

sftp.get(filePath, rawfileOutputStream);

rawfileOutputStream.close();

}


String content = new String(rawfileOutputStream.toByteArray(),
"windows-1252");


log.info("initialContent: " + content);


content = this.removeDTDFromMetadata(content);


// transform the file

TransformerFactory tFactory = TransformerFactory.newInstance();

Transformer transformer = tFactory.newTransformer(new StreamSource(new
URL(this.taxXSLT).openStream()));


ByteArrayInputStream rawfileInputStream = new
ByteArrayInputStream(content.getBytes("windows-1252"));

ByteArrayOutputStream transformedFileOutputStream = new
ByteArrayOutputStream();


File transformedFile = new File("../server/ic/deploy/data.war/" +
this.taxXSLTResult);

FileOutputStream out = new FileOutputStream(transformedFile);


transformer.transform(

new StreamSource(new InputStreamReader(rawfileInputStream)),

new StreamResult(new OutputStreamWriter(out, "UTF-8")));

rawfileInputStream.close();

transformedFileOutputStream.close();

// move file up to data dir

}

****************************************************************************
***

Thanks,

pramodh.

----- Original Message -----
From: "Alex Rosen" <arosen at novell.com>
To: <peddip at contextmedia.com>; <manish.sharan at divlogic.com>
Cc: <jdom-interest at jdom.org>
Sent: Sunday, November 02, 2003 3:12 PM
Subject: Re: [jdom-interest] special characters problem


> You shouldn't need to set the file.encoding option. You just need to
> make sure that you read and write the file using the encoding that the
> file is actually in, rather than the platform's default encoding
> (whatever that may be). The easiest way to do this is to let JDOM do it
> for you - give it an InputStream (instead of a Reader or a String) when
> parsing, and an OutputStream (instead of a Writer) when streaming. If
> you can't do this, then you'll have to manually make sure that you're
> using the right encoding when creating a Reader or Writer.
>
> Alex
>
> >>> <manish.sharan at divlogic.com> 10/30/2003 3:52:44 PM >>>
> Hi Pramodh
> This is not a JDOM issue.
>
> But I had the same problem -- my app worked fine with a french xhtml
> file on
> Windows but on Linux, it turned all special characters to '?'
>
> I was dealing with ISO-8859-1  encoded files ( french xhtml). So my app
> read
> the file and simply saved it onto a file on my disk, the resulting file
> looked
> ok on Windows. Howevere,when I ran this test on Linux , the result has
> a lot
> of '?' .
>
> So before you pass the string to JDOM, you need to make sure that you
> read it
> correctly.
>
> I my case, I  fixed the problem by explicitly defing a caharater set
> with
> InputStream : InputStreamReader( inputStream,"ISO-8859-1" )
>
> In Java options, use -Dfile.encoding=ISO_8859-1
>
> Please note that I dont have my development machine before me, so my
> code and
> sample may not be exactly correct.
>
>
> Regards
> -manish
>
> Quoting Pramodh Peddi <peddip at contextmedia.com>:
>
> > Hi Manish,
> > Thanks for responding! Did you have exactly the same problem? i.e,
> Working
> > fine on windows but not on Unix?
> >
> > Can you tell me exactly what should be done in Java to do this. I am
> using
> > Java1.4.1. Should i mention the file.encoding in JAVA_OPTS? If so,
> what
> > should I mention. And is this what all I should do to make it work?
> Is the
> > way I build the document ok?
> >
> > Sorry for asking too many questions:-)!
> >
> > You are right, I am using InpustStreams to read external data.
> >
> > Thanks,
> > pramodh.
> > ----- Original Message -----
> > From: <manish.sharan at divlogic.com>
> > To: "Pramodh Peddi" <peddip at contextmedia.com>
> > Cc: <jdom-interest at jdom.org>
> > Sent: Thursday, October 30, 2003 1:41 PM
> > Subject: Re: [jdom-interest] special characters problem
> >
> >
> > > I recently solved this kind of problem by enforcing charset
> encoding all
> > theb
> > > way from JVM "file.encoding" option to using the charset encoding
> name
> > whenever
> > > using any InputStreams to read external data .
> > >
> > > The windows and Unix/Linux behaviorial difference with respect to
> sepcial
> > > characters is due to  the differing default charset encoding.
> > >
> > > Hope this helps.
> > > -manish
> > >
> > >
> > > Quoting Pramodh Peddi <peddip at contextmedia.com>:
> > >
> > > > Hi,
> > > > I am using JDOM Beta 8 version for XML parsing. we are happening
> to have
> > lot
> > > > of special characters (like registered marks, copyright symbols,
> trade
> > > > marks, and other many funky chars). After building the document,
> the
> > parser
> > > > is converting the characters into "?" characters. This is what I
> am
> > doing to
> > > > build the document:
> > > >
> > > >
> >
>
****************************************************************************
> > > > ************
> > > > // Method to return a Document object given an xml String
> > > >
> > > > public Document getDocumentfromString(String xmlString)
> > > >
> > > > throws Exception {
> > > >
> > > > Document schemaDoc = null;
> > > >
> > > > SAXBuilder builder = new SAXBuilder(false);
> > > >
> > > > String resultingXML = null;
> > > >
> > > > if(!StringUtils.isEmpty(xmlString)){
> > > >
> > > >
> > > > try{
> > > >
> > > > schemaDoc =
> > > >
> > > > builder.build(
> > > >
> > > > new StringReader(xmlString));
> > > >
> > > > }catch(JDOMException jdomex){
> > > >
> > > > throw new Exception("Document could not be built: " + jdomex);
> > > >
> > > > }
> > > >
> > > > }else{
> > > >
> > > > log.info("xmlString is null");
> > > >
> > > > }
> > > >
> > > > return schemaDoc;
> > > >
> > > > }
> > > >
> > > >
> >
>
****************************************************************************
> > > > ****
> > > >
> > > > It is working fine on Windows (2000) machine, but spitting "?"
> symbols
> > in
> > > > place of special chars on UNIX machines.
> > > >
> > > > I used to use schemaDoc = builder.build(new
> > > > java.io.ByteArrayInputStream(xmlString.getBytes()));
> > > >
> > > > to build the document in place of StringReader, but it was
> changing the
> > > > encoding and throwing exception saying the special
> > > >
> > > > chars don't belong to UTF-8. So, i changed it to StringReader -
> which
> > > > doesn't throw exceptions but, converts the special chars to "?".
> > > >
> > > > I also tried using builder.build(new
> > > > java.io.ByteArrayInputStream(xmlString.getBytes(
> > > >
> > > > "UTF-8"
> > > >
> > > > )));
> > > >
> > > > . But that din't help too.
> > > >
> > > >
> > > >
> > > > Again, "?" are occuring only in UNIX machines, but works fine on
> Windows
> > > > machines.
> > > >
> > > >
> > > >
> > > > I would appreciate any help.
> > > >
> > > >
> > > >
> > > > Thank you,
> > > >
> > > >
> > > >
> > > > pramodh.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > To control your jdom-interest membership:
> > > > http://lists.denveronline.net/mailman/options/jdom-
> > > interest/youraddr at yourhost.com
> > > >
> > >
> > >
> > >
> > >
> >
> >
>
>
>
> _______________________________________________
> To control your jdom-interest membership:
>
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhos
t.com
>




More information about the jdom-interest mailing list