The Secret behind Reversible Encoding Scheme

In one of our previous engagements, we encountered a situation where we needed to perform a slight modification to the generated PDF bill image, prior to displaying it to the Customer Service Reps (CSRs).  The generated PDF bill image was stored in the SQL Server database in an “Image” column.  This image record would then be retrieved for display using either an ASP .NET page, or a JSP page depending on how the CSR’s retrieve the account information.

The task at hand was to search all occurrences of a particular file extension (e.g., .txt) in the PDF image, and subsequently replaced them with another file extension (e.g., .doc).  So, our processing logic would simply be as follows:

  1. Retrieve the bill image from SQL Server database and store them in a byte array
  2. Convert the byte array into String object for search and replace operation
  3. Convert the resulting String object into byte array for rendering in a browser

The search/replace operation in step 2 was done very easily in C#, as follows (note:  assume PdfDs is the dataset containing a record retrieved from the image column in SQL Server table):

byte[] m_Data = (byte[])PdfDs.Tables[0].Rows[0]["pdf_image"];
string m_DataConvert = System.Text.Encoding.Default.GetString(m_Data);
m_DataConvert = m_DataConvert.Replace(".txt", ".doc");
m_Data = System.Text.Encoding.Default.GetBytes(m_DataConvert);

As can be seen from the above code, we utilized the “Default” encoding in C# for our Search/Replace operation.  It worked as expected, and the resulting byte array was then successfully rendered in the browser.

We were expecting that we would be able to use similar kind of logic to perform the same operation in Java.  That is, we were expecting that we would also be able to use the machine default encoding in Java to perform the Search/Replace operation.  Therefore, our initial code in Java looked something like this (note:  assume that m_ResultSet is the dataset containing a record retrieved from the image column in SQL Server table):

byte[] m_PdfByteArray = m_ResultSet.getBytes(1);
ByteArrayOutputStream m_Baos = new ByteArrayOutputStream();
m_Baos.write(m_PdfByteArray);
String m_InitPDF = m_Baos.toString();
m_InitPDF = m_InitPDF.replace(".txt", ".doc");
byte[] m_bytePdfFinal = m_InitPDF.getBytes();

Notice that we did not provide any argument to the toString() and getBytes() method calls in the above code, and thus enabled Java to utilize the platform’s default character encoding. That should be the end of it, except that somehow we’ve corrupted the resulting byte array.  When the byte array was rendered in the browser, we got nothing but blank screen!!!  What happened to it?  This should have been very simple, as has been demonstrated in our C# code up above.

After further research we found out that the platform’s default encoding in Java is Cp1252, which happens to be irreversible.  This would cause the resulting byte array to be different than the incoming one, even if the replace method call did not get executed.
The trick was to find a “reversible” encoding scheme, and one such scheme is ISO8859_1. Thus, the Java code above needed to be modified as follows:

byte[] m_PdfByteArray = m_ResultSet.getBytes(1);
ByteArrayOutputStream m_Baos = new ByteArrayOutputStream();
m_Baos.write(m_PdfByteArray);
String m_InitPDF = m_Baos.toString("ISO8859_1");
m_InitPDF = m_InitPDF.replace(".txt", ".doc");
byte[] m_bytePdfFinal = m_InitPDF.getBytes("ISO8859_1");

In conclusion, while we may rely on default character encoding in C# to perform our Search/Replace operation, we would not be able to do so in Java.  Instead, we need to find one of the reversible encoding schemes, which happens to be a non-default one!