Character data, Unicode and Collations

Introduction

While modeled on the ASCII character set, the Unicode Standard goes far beyond ASCII’s limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world – more than 1 million characters can be encoded. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility.

In addition to the Unicode standard, there is a technical standard called Unicode Collation Algorithm (UCA). UCA provides a specification for how to compare two Unicode strings while remaining conform to the requirements of The Unicode Standard. Comparisons like this are done by using so called Collations.

This article presents the Mimer SQL perspective of character data, Unicode and collations. See also the basic concepts of national characters and linguistic sorting described in the Mimer SQL Documentation Set.

(The Unicode Standard is fully compatible with the international standard ISO/IEC 10646, Information Technology – Universal Multiple-Octet Coded Character Set (UCS). The Unicode Collation Algorithm (UCA) standard is kept synchronized with the ISO/IEC 14651 standard for International String Ordering).

Unicode Data (NCHAR)

Mimer SQL implements Unicode using the data type NCHAR. You may use any of the three encoding forms UTF-8, UTF-16, or UTF-32 when you store NCHAR data in the database. (UTF stands for Unicode Transformation Format, which is standard character encoding schemes in accordance with ISO 10646). The encoding forms are fully transparent; for example, you may use UTF-16 to store data, and you can use UTF-8 for fetching data. The NCHAR data type is logically UTF-32, however, it is stored in a compressed form.

The CHAR data type is based on ISO 8859-1 (Latin-1), which is a true subset of Unicode, and therefore CHAR and NCHAR are fully compatible. Of course, storing NCHAR data in a CHAR column have certain restrictions.

Unicode string literal

A Unicode-string-literal facilitates the specification of Unicode characters in an ASCII environment. It consists of a sequence of Unicode characters enclosed in quotation marks and preceded by the letter U and an ampersand (&). Two consecutive single quotation marks within a string are interpreted as a single quotation mark. Unicode characters are given by four hexadecimal digits preceded by a backslash character or, by six hexadecimal digits preceded with a backslash character and a plus character. Two consecutive backslash characters within a string are interpreted as a single backslash character. The case of the preceding U is irrelevant.

Example:

U&'\0141ód\017A' -- 'Łódź' (Polish town)

Normalization

A Unicode character can have several equivalent representations. There are pre-composed characters and there are combining characters that can be used together with base characters to form a specific character.

Consider the letter E with circumflex and dot below, a character that occurs in Vietnamese, Ệ. This character has five possible representations in Unicode. (About the notation below; In running text, individual Unicode code points usually are expressed as U+n, where n is from four to six hexadecimal digits):

U+0045 LATIN CAPITAL LETTER E
U+0302 COMBINING CIRCUMFLEX ACCENT
U+0323 COMBINING DOT BELOW
U+0045 LATIN CAPITAL LETTER E
U+0323 COMBINING DOT BELOW
U+0302 COMBINING CIRCUMFLEX ACCENT
U+00CA LATIN CAPITAL LETTER E WITH CIRCUMFLEX
U+0323 COMBINING DOT BELOW
U+1EB8 LATIN CAPITAL LETTER E WITH DOT BELOW
U+0302 COMBINING CIRCUMFLEX ACCENT
U+1EC6 LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND DOT BELOW

Any two of these sequences should compare equal. The Normalization Form C (NFC) of all five sequences is U+1EC6.

In Mimer SQL, Unicode data (NCHAR) is automatically transformed to NFC. When needed, literals and variables are implicitly normalized. The result of a concatenation will always be normalized, and string functions, like UPPER and LOWER, will always return a normalized result string. This will assert that all Unicode data will be in NFC, thus giving the expected result in search operations.

Example:

 SQL>create table t(c nchar(1));
 SQL>insert into t values(u&'E\0302\0323');
 SQL>insert into t values(u&'E\0323\0302');
 SQL>insert into t values(u&'\00CA\0323');
 SQL>insert into t values(u&'\1EB8\0302');
 SQL>insert into t values(u&'\1EC6');
 SQL>select count(c) as equal from t where c = u&'\1EC6';

 EQUAL
 =====
 5

The normalization forms are fully described in the Unicode standard annex #15. Mimer SQL have passed the Unicode normalization conformance test.

Collations and Linguistic Sorting

Character and national character data is sorted according to specific collations.

A collation, also known as a collating sequence, is a database object containing a set of rules that determines how character strings are compared, searched and alphabetically sorted. The rules in the collation determine whether one character string is less than, equal to or greater than another. A collation also determines how case-sensitivity and accents are handled, using different comparison levels.

You can specify a different collation for ordering characters when you create or alter a table or create a domain.

If you have specified a collation for a column, the collation is used implicitly in SQL statements.

You only need to explicitly use a collation clause in SQL statements if you want to override the default collation or the collation you specified when creating or altering the table or creating the domain. The following example will explicitly give a Swedish sort order (where swedish_3 is a predefined collation in Mimer SQL):

SELECT name
  FROM adress
  ORDER BY name COLLATE swedish_3

Since Unicode is a universal character set the Unicode sorting order can be employed on any arbitrary character set that is a subset of Unicode.

The default Unicode sorting order is provided in allkeys.txt. This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements, i.e. sorting sequences, for all the explicitly weighted characters.

Mimer SQL has passed the UCA conformance test.

Multilevel Comparison

As mentioned above there are different levels of comparisons to pay attention to, such as case and accent sensitivity. There are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level:

Primary level
Typically, this is used to denote differences between base characters (for example, “a” < “b”). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength.
Secondary level
Accents in the characters are usually considered secondary differences (for example, “ab” < “áb” < “ac”). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
Note: In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.
Tertiary level
Upper- and lower-case differences in characters are distinguished at the tertiary level (for example, “ab” < “Ab” < “áb”). In addition, a variant of a letter differs from the base form on the tertiary level (such as “a” and “ª”). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.
Quaternary level
When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, “ab” < “a c” < “a-c” < “ac”). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.

Multilevel comparison means the following; Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence cannot be established, the two strings are regarded as equivalent.

How far to go in this comparison chain is decided by the definition of the collation used on the data. See “Tailorings” below.

Alternate Weighting

Concerning sorting of punctuation characters (such as SPACE, HYPHEN, COMMA, etc.) alternate weighting can be used.

The collating keys for these characters are known as alternate collation elements. These elements can be treated differently depending on the weighting method used:

Non-ignorable
The Non-ignorable method means that alternate collation elements are treated as normal collation elements. This is the default option.
Shifted
The Shifted method sets the quaternary level (4:th) equal to the value of the primary level for alternate collation elements, and then the primary, secondary, and tertiary levels are set to zero. The collation elements for all other characters with a non-zero primary level weight, will receive a fourth-level weight of 0xFFFF. Please note that if the primary weight is zero, the fourth-level weight is also zero.

Shift-trimmed
The Shift-trimmed method is equal to the Shifted method concerning the alternate collation elements, but all other collation elements are set to zero. This will emulate POSIX behavior.

The following gives an example of the alternate weighting differences.

Non-ignorable Shifted Shift-trimmed

de luge           death            death
de Luge           de luge          deluge
de-luge           de-luge          de luge
de-Luge           deluge           de-luge
death             de Luge          deLuge
deluge            de-Luge          de Luge
deLuge            deLuge           de-Luge
demark            demark           demark

Contracting Character Sequences

Many languages have digraphs, i.e. two characters that should be considered as one, which actually counts as separate letters. In traditional Spanish, “ch” sorts between “c” and “d”, and “ll” sorts between “l” and “m”. Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.

Another example of contractions are “lj” and “nj” in Bosnian and Croatian, which sorts after “l” and “n” respectively.

Order without contraction Order with contraction “nj” sorting after “n”:

Na                           Na
Ni                           Ni
Nj                           Nk
Nja                          Nz
Njz                          Nj
Nk                           Nja
Nz                           Njz
Oa                           Oa

Expanding Characters

A single character can map to a sequence of collation elements. For instance, “ß” is equivalent to “ss”. In traditional German “ä”, “ö”, and “ü” sort as though they were “ae”, “oe”, and “ue” respectively.

Tailorings

Mimer SQL includes predefined collations for many languages. Usually there are two predefined collations per language, one for the primary comparison level and one for the tertiary comparison level. These are named _, e.g. SWEDISH_1.

If a new collation should be created, this is done by customizing an existing collation using a tailoring. A tailoring is a set of rules and attributes that forms a so-called collation delta string, which is used as the basis when creating a collation. The tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement.

Attributes

When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.

Attributes are optional, the bold values in the table below are the default values.

Option Value Description

Level          [Level 1]                   Sort level for the collation.
               [Level 2]

[Level 3]

               [Level 4]
 
Accent order

[AccentOrder Forward]

       Secondary level ordering direction.
               [AccentOrder Backward] 
Case first

[CaseFirst Lower]

           Tertiary level case ordering.
               [CaseFirst Upper] 
Alternate

[Alternate Non-ignorable]

   Alternatives for punctuation characters.
               [Alternate Shifted]
               [Alternate Shift-trimmed]

Rules

The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.

Symbol Example Example description

&      &Z       Reset at this letter. Rules will be relative to this letter from here on.
<      a < b    Identifies a primary difference between "a" and "b".
<<     e << ê   Identifies a secondary difference between "e" and "ê".
<<<    s <<< S  Identifies a tertiary difference between "s" and "S".
=      i = y    Signifies no difference between "i" and "y".
"      ","      The quoted character "," (comma).
#      #0141#   Hexadecimal representation of "Ł" (L with stroke).

Note: “;” can be used to represent secondary relations and “,” to represent tertiary relations, instead of “<<” and “<<<” respectively.

Examples

A Danish tailoring example

 [level 4]
 [casefirst upper]
 [alternate shifted]
 & TH << þ <<< Þ
 & Y << ü <<< Ü
 & Z < æ <<< Æ << ä <<< Ä < ø <<< Ø << ö <<< Ö < å <<< Å << aa <<< Aa <<< AA

A Create Collation example

In the following example the my_collation is created. The new collation is based upon the existing SWEDISH_3 collation and tailored so that “aa” is sorted as a variant of “å” and that upper case letters are sorted before lower case letters:

CREATE COLLATION my_collation
 FROM swedish_3
 USING '[casefirst upper] &Å << aa <<< Aa <<< AA'

A sorting example

Here is an example of how to sort two fields properly; in this case ‘last name’, ‘first name’. The example data is from the excellent book Unicode Demystified by Richard Gillam.

 SQL>create table name(last varchar(32),first varchar(32));
 SQL>insert into name values('van Dusen','Paul');
 SQL>insert into name values('van Dusen','Tom');
 SQL>insert into name values('vanDusen','Paul');
 SQL>insert into name values('vanDusen','Tom');
 SQL>insert into name values('Van Dusen','Paul');
 SQL>insert into name values('Van Dusen','Tom');
 SQL>insert into name values('Van','Steve');
 SQL>insert into name values('Van','Bill');
 SQL>create collation names from EOR
 SQL&using '[level 4][alternate shifted]&9<","';          -- "," before "A" 
 SQL>select last || ', ' || first as fullname
 SQL&from name order by fullname collate names;

 FULLNAME
 ========
 Van, Bill
 Van, Steve
 van Dusen, Paul
 vanDusen, Paul
 Van Dusen, Paul
 van Dusen, Tom
 vanDusen, Tom
 Van Dusen, Tom

Sorting specifics

Backward Accent Ordering

Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left. This behavior is also called “French accent sorting” or “French secondary order”. See the following standard example:

English ordering French ordering

Cote             Cote
Coté             Côte
Côte             Coté
Côté             Côté

Thai/Lao reordering

The Thai and Lao scripts contain left-joining vowels, that precede the consonants. To get the right sort order, the collation element for the vowel is exchanged with the collation element for the consonant that follows the vowel.

EOR – European Ordering Rules

The Unicode Default Order (and ISO/IEC 14651) have defined the default Latin alphabet to contain not only the base letters “A” through “Z”, but also a number of more or less language specific base letters. One example, the Polish “Ł” (L with stroke) is not a variant of “L”, it is a separate base letter between “L” and “M”.

The EOR, European Ordering Rules, ENV 13710 (and ISO 12199 – Alphabetical ordering of multilingual terminological and lexicographical data represented in the Latin alphabet) have taken a more natural approach: The alphabet is “A” through “Z”, and the other language specific letters are secondary variants of the corresponding base letter.

In Mimer SQL, we have used the EOR tailoring as the basis for all specific language tailorings.

Language support

Using the Unicode Default Collation ordering, Mimer SQL supports sorting for in principle all languages by default. Still, some languages need a specific tailoring to achieve a correct sort order. Mimer SQL contains a large number of such predefined tailorings for specific languages, see the Mimer SQL Collation charts page for further details on languages and collations.

Graphic Element - Cube