Legacy Databases with non ISO 8859-1 in CHAR Columns

Introduction

The traditional way of storing character data is using arrays of bytes. This is what you get when you use the CHARACTER or CHARACTER VARYING datatypes in Mimer SQL. These datatypes have often been used to store any character data, not just ISO 8859-1 (aka Latin1) strings. This poses a problem to newer versions of Mimer SQL which treats CHARACTER and CHARACTER VARYING data types as strictly ISO 8859-1.

Description

Since Mimer SQL assumes a CHARACTER column contains ISO 8859-1 characters, using these columns to store other character encodings poses several problems. Mainly, these issues are that the collation ordering is undefined, and that certain SQL functions (most notably UPPER and LOWER) expect ISO 8859-1 strings.

As long as the database programmer is aware of this, thus avoiding such functionality on these columns, everything will work fine. The data inserted into the database will be returned when it is retrieved.

However, if you are using Java or .Net the situation will become a little different. These environments use Unicode based strings. This is also the only character string supported by the database interface. In these situations things are a little more cumbersome since it is assumed that the contents in the CHARACTER columns is ISO 8859-1. There is the same problem using ODBC or Embedded SQL if you are retrieving the data into wide character strings (SQLWCHAR and wchar_t respectively). Everytime a Java String object is sent from and to the server, it is converted from the byte based character string (in which ISO 8859-1 encoding is assumed) to a Unicode string.

Solution

The recommended approach to solve the problem is to upgrade the database schema to use NATIONAL CHARACTER (NCHAR) columns. Later on, we propose SQL-functions and statements to upgrade a table to use NCHAR columns. It is possible to use the functions in conjunction with views and instead-of-triggers to make the upgrade entirely seamless.

The recommended solution will be harder to use in those situations where there are legacy applications, perhaps using Embedded SQL or ODBC, which assumes that the backend column is a CHARACTER. See further down for a solution for these legacy applications. If there are Level2 applications accessing the CHARACTER column, this approach is impossible. The obsolete Level2 client interface cannot access NATIONAL CHARACTER columns.

Another option is to leave the tables as they are, and to use views and instead-of-triggers to project NATIONAL CHARACTER data to Java and .Net clients. Since this solution doesn’t change the tables, older applications (most notably Level2-applications) will continue to work unchanged.

An inferior solution, but in many cases acceptable, is to specifically use the functions described below in queries and updates from the Java (or .Net) environment. The major disadvantage of using the functions in a deployed environment is performance and that care must be taken when these are used in WHERE-conditions, or else, available indexes might not be used.

The following table lists the advantages and disadvantages with each solution:

	Upgrading table	Leaving table as is	Leaving table as is with no views/triggers
Advantages	¤ Seamless implementation. ¤ Good performance for NATIONAL CHARACTER data. ¤ Best solution for Java and .Net environments. ¤ Works well in J2EE and other application server environments.	¤ Seamless implementation. ¤ Best performance for CHARACTER data. ¤ Fast implementation, even for very large tables. ¤ Works well in J2EE and other application server environments.	¤ No database changes.
Disadvantages	¤ Less performance for data returned as CHARACTER. ¤ If the table is very large, the upgrade process may take some time. ¤ Level2 applications cannot access the upgraded table. ¤ Queries expecting that the CHARACTER column is indexed may run slower since only the base table is indexed and not the view.	¤ Less performance for data returned as NATIONAL CHARACTER. ¤ Queries expecting that the NATIONAL CHARACTER column is indexed may run slower since only the base table is indexed and not the view.	¤ Will not work in with many application servers. ¤ No seamless implementation, database programmers must use functions in their queries. ¤ Care must be taken when using WHERE-conditions to make sure indexes are used.

Functions to convert to and from Unicode

The basis of the solutions outlined above are two SQL functions that perform the appropriate conversion to and from Unicode strings. In essence, these functions will provide a custom CAST from CHARACTER to NATIONAL CHARACTER and vice versa.

The attached archive contains both SQL-99 compliant functions to convert all ISO 8859-2 through ISO 8859-16 character encodings to a NATIONAL CHARACTER variable. NATIONAL CHARACTER data will, in turn, be handled correctly by any client interface. In addition, the archive includes a Perl script to generate SQL-99 functions from the ISO 8859 mappings specifications (also included) downloaded from the Unicode consortium website. The Unicode consortium has published several other mappings on their web page. The code in the archive requires Mimer SQL 9.2 or later.

The functions described above are then used like the following:

The statement…

select OUR_CHAR_COL from OUR_TABLE

…should become…

select "NCHAR_FROM_ISO8859-7"(OUR_CHAR_COL) from OUR_TABLE

INSERT:s are handled similarly, for instance:

insert into OUR_TABLE values ("NCHAR_TO_ISO8859-7"(?))

As discussed earlier, using these functions in the actual queries and DML statements is bad for a number of reasons. What we really want is a seamless solution. Depending on your requirements, most notably if you have existing Level2 applications that must continue to work, you can choose either to upgrade the tables or leave them as they are. Advantages and disadvantages of each solution are listed in the table above.
More on this later, but first a few tips on how to upgrade the tables.
Using the functions to upgrade a table to Unicode

I believe the idea is best described by using an example. Say this is your table:

create table THE_TABLE (C1 integer primary key, C2 varchar(200));

The first step is to create a new table using a NATIONAL CHARACTER VARYING (NVARCHAR) column instead.

create table NEW_TABLE(C1 integer primary key, C2 nvarchar(200));

The new table must be populated by using data from the old table, but doing the necessary conversions on the character column.

insert into NEW_TABLE select c1,"NCHAR_FROM_ISO8859-7"(c2) from THE_TABLE;

Upgrading a table with referential constraints

If there are referential constraints on the old table, these tables must be upgraded as well. This will have a cascading effect since the datatypes of those tables must be changed to NATIONAL CHARACTER VARYING as well. Say that we have the following situation:

 create table THE_TABLE(C1 char(5) primary key,C2 date, C3 integer);
 create table REF_TABLE(COL1 date primary key,
                        COL2 char(5) references THE_TABLE);

We first upgrade the THE_TABLE:

 create table NEW_TABLE(C1 nchar(5) primary key, C2 date, C3 integer);
 insert into NEW_TABLE
 select "NCHAR_FROM_ISO8859-7"(C1),C2,C3 from THE_TABLE;

We then repeat the procedure on the referencing table:

 create table NEW_REF_TABLE(COL1 date primary key,
                            COL2 nchar(5) references NEW_TABLE);
 insert into NEW_REF_TABLE
 select "NCHAR_FROM_ISO8859-7"(COL1),COL2 from THE_TABLE;

Seamless implementation of an upgraded table

In order to achieve a seamless implementation for both new Java and .Net applications as well as for legacy Level2, Embedded SQL or ODBC applications, we must use views to project data accordingly, and use instead-of-triggers to handle DML statements on that view.
Since the base table is upgraded to NATIONAL CHARACTER (see above) we provide views with CHARACTER columns. Note that we drop the old table and provide a view with the same name.

drop table THE_TABLE;
 create view THE_TABLE(C1,C2) as
 select C1,"NCHAR_TO_ISO8859-7"(C2) from NEW_TABLE;

The view ensures that old applications can query the table seamlessly. To handle DML-statements, we create instead-of-triggers. The instead-of-triggers trigger on INSERT, DELETE and UPDATE statements on views. One trigger for each statement type is required.

 create trigger THE_TABLE_TRIGGER_INSERT instead of insert on THE_TABLE
 referencing new table as NEW_BD
 begin atomic
   insert into NEW_TABLE select c1, "NCHAR_FROM_ISO8859-7"(c2) from NEW_BD;
 end

 create trigger THE_TABLE_TRIGGER_UPDATE instead of update on THE_TABLE
 referencing new table as NEW_BD
 when (exists (select * from NEW_BD))
 begin atomic
   update NEW_TABLE
   set C2 = (select "NCHAR_FROM_ISO8859-7"(c2)
             from NEW_BD
             where NEW_TABLE.C1=NEW_BD.C1)
   where NEW_TABLE.C1 in (select C1 from NEW_BD);
 end

 create trigger THE_TABLE_TRIGGER_DELETE instead of delete on THE_TABLE
 referencing old table as OLD_BD
 when (exists (select * from OLD_BD))
 begin atomic
   delete from NEW_TABLE where NEW_TABLE.C1 in (select C1 from OLD_BD);
 end

Seamless implementation of a table as it is

As we noted above, it is also possible to leave the tables as they are, with CHARACTER columns, and use views and instead-of-triggers to handle Unicode applications.
In this case, the table is left in place, and we create a view to return NATIONAL CHARACTER data.

create view NEW_TABLE(C1,C2) as
 select C1,"NCHAR_FROM_ISO8859-7"(C2) from THE_TABLE;

Similarly, instead-of-triggers are created to handle DML statements on this view.

 create trigger NEW_TABLE_TRIGGER_INSERT instead of insert on NEW_TABLE
 referencing new table as NEW_BD
 begin atomic
   insert into THE_TABLE select c1, "NCHAR_TO_ISO8859-7"(c2) from NEW_BD;
 end

 create trigger NEW_TABLE_TRIGGER_UPDATE instead of update on NEW_TABLE
 referencing new table as NEW_BD
 when (exists (select * from NEW_BD))
 begin atomic
   update THE_TABLE
   set C2 = (select "NCHAR_TO_ISO8859-7"(c2)
             from NEW_BD
             where THE_TABLE.C1=NEW_BD.C1)
   where THE_TABLE.C1 in (select C1 from NEW_BD);
 end

 create trigger NEW_TABLE_TRIGGER_DELETE instead of delete on NEW_TABLE
 referencing old table as OLD_BD
 when (exists (select * from OLD_BD))
 begin atomic
   delete from THE_TABLE where THE_TABLE.C1 in (select C1 from OLD_BD);
 end

Links

Various mappings to Unicode (from the Unicode Consortium web site) »
Zip archive with the code »

Graphic Element - Cube