Mimer SQL Documentation TOC PREV NEXT INDEX

Mimer SQL Developer Site


Collations and Linguistic Sorting


As stated in the previous section, character and national character data is sorted according to specific collations.

A collation, also known as a collating sequence, is a database object containing a set of rules that determines how character strings are compared, searched and alphabetically sorted. The rules in the collation determine whether one character string is less than, equal to or greater than another. A collation also determines how case-sensitivity and accents are handled.

You can specify a different collation for ordering characters when you create or alter a table or create a domain.

If you have specified a collation for a column, the collation is used implicitly in SQL statements.

You only need to explicitly use a collation clause in SQL statements if you want to override the default collation or the collation you specified when creating or altering the column or creating the domain.

For more information, see Mimer SQL User's Manual, Collations.

Since Unicode is a universal character set the Unicode sorting order can be employed on any arbitrary character set that is a subset of Unicode.

The default Unicode sorting order is provided in http://www.unicode.org/reports/tr10/allkeys.txt. This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements for all the explicitly weighted characters.

Multilevel Comparisons

As mentioned above there are different levels of comparisons to pay attention to, such as case and accent sensitivity. From the Default Unicode Collation Element Table, referred to above, the following definition for the letter b is picked:

 0062  ; [.0A29.0020.0002.0062] # LATIN SMALL LETTER B

Within square brackets there are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level:

Primary level:

Typically, this is used to denote differences between base characters (for example, a < b). It is the strongest difference. For example, dictionaries are divided into different sections by the base character. This is also called the level-1 strength.

Mimer SQL's predefined level 1 collations have names ending with _1, e.g. ENGLISH_1.

Secondary level:

Accents in the characters are usually considered secondary differences (for example, ab < b < ac). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.

Note: In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.
Tertiary level:

Upper and lower case differences in characters are distinguished at the tertiary level (for example, ab < Ab < b). In addition, a variant of a letter differs from the base form on the tertiary level (such as a and ). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.

Mimer SQL's predefined level 3 collations have names ending with _3, e.g. ENGLISH_3.

Quaternary level:

When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, ab < a c < a-c < ac). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.

Multilevel comparison means the following; Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence cannot be established, the two strings are regarded as equivalent.

How far to go in this comparison chain is decided by the definition of the collation used on the data. See Tailorings.

Alternate Weighting

Alternate collation elements, i.e. punctuation, can be treated different depending on the weighting method used:

Non-ignorable

Alternate collation elements are treated as normal collation elements. This is the default option.

Shifted

Alternate collation elements are set to zero at the primary, secondary, and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements, with a non-zero primary weight, will receive a fourth-level weight of 0xFFFF. If the primary weight is zero, the fourth-level weight is also zero.

Shift-trimmed

Alternate collation elements are set to zero at the primary, secondary, and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements are set to zero. This will emulate POSIX behavior.

The following gives an example of the alternate weighting differences.

Non-ignorable
Shifted
Shift-trimmed
de luge
death
death
de Luge
de luge
deluge
de-luge
de-luge
de luge
de-Luge
deluge
de-luge
death
de Luge
deLuge
deluge
de-Luge
de Luge
deLuge
deLuge
de-Luge
demark
demark
demark

Tailorings

A tailoring is a set of rules and attributes that forms a so called collation delta string, which is used as the basis when creating a collation. When a new collation is to be created, the tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement, see CREATE COLLATION.

Attributes

When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.

Attributes are optional.

Option
Values
Description
Level
[Level 1]
[Level 2]
[Level 3]
[Level 4]
Sort level for the collation.
[Level 3] is default.
Accent order
[AccentOrder Forward]
[AccentOrder Backward]
Secondary level ordering direction.
[AccentOrder Forward] is default.
Case first
[CaseFirst Lower]
[CaseFirst Upper]
Tertiary level case ordering.
[CaseFirst Lower] is default.
Alternate
[Alternate Non-ignorable]
[Alternate Shifted]
[Alternate Shift-trimmed]
Alternatives for punctuation.
[Alternate Non-ignorable] is default.
Hiragana
[Hiragana On]
[Hiragana Off]
Option for Japanese sorting.
Use
[Hiragana On][Level 4] for full Japans ordering.
[Hiragana Off] is default.
Numeric
[Numeric On]
[Numeric Off]
Option for numeric sorting.
[Numeric Off] is default.

Rules

The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.

Symbol
Example
Example description
&
&Z
Reset at this letter. Rules will be relative to this letter from here on.
<
a < b
Identifies a primary level difference between a and b.
<<
e <<
Identifies a secondary level difference between e and .
<<<
s <<< S
Identifies a tertiary level difference between s and S.
=
i = y
Signifies no difference between i and y.
"
","
The quoted character , (comma).
#
#0141#
Hexadecimal representation of Polish L with stroke.

Note: ; can be used to represent secondary relations and , to represent tertiary relations, instead of << and <<< respectively.

Example:

The following is a Danish tailoring example:

[level 4]

[casefirst upper]

[alternate shifted]

& Y << <<<

& Z < <<< << <<< < <<< << <<< < <<< << aa <<< Aa <<< AA

Sorting

Here is an example on how to sort numerical data properly:

 SQL>CREATE TABLE alphanum (codes VARCHAR(10));
 SQL>INSERT INTO  alphanum VALUES('A123');
 SQL>INSERT INTO  alphanum VALUES('A234');
 SQL>INSERT INTO  alphanum VALUES('A23');
 SQL>INSERT INTO  alphanum VALUES('A3');
 SQL>INSERT INTO  alphanum VALUES('A1');
 
 SQL>-- Regular order [Numeric Off]
 SQL>SELECT * FROM alphanum ORDER BY codes;
 
 CODES
 ==========
 A1
 A123
 A23
 A234
 A3
 
 SQL>-- Numeric order [Numeric On]
 SQL>CREATE COLLATION numeric FROM eor USING '[Numeric On]';
 SQL>SELECT * FROM alphanum ORDER BY codes COLLATE numeric;
 
 CODES
 ==========
 A1
 A3
 A23
 A123
 A234
 

Here is an example on how to sort two fields properly; in this case 'last name', 'first name':

 SQL>create table name(last varchar(32),first varchar(32));
 SQL>insert into  name values('van Diesel','Peter');
 SQL>insert into  name values('van Diesel','Thomas');
 SQL>insert into  name values('vanDiesel','Peter');
 SQL>insert into  name values('vanDiesel','Thomas');
 SQL>insert into  name values('Van Diesel','Peter');
 SQL>insert into  name values('Van Diesel','Thomas');
 SQL>insert into  name values('Van','Stephan');
 SQL>insert into  name values('Van','Buster');
 SQL>create collation names from EOR_1
 SQL&using '[level 4][alternate shifted]&9<","'; -- ',' before 'A'
 SQL>select last || ', ' || first as fullname
 SQL&from name order by fullname collate names;
 
 FULLNAME
 ========
 Van, Buster
 Van, Stephan
 van Diesel, Peter
 vanDiesel, Peter
 Van Diesel, Peter
 van Diesel, Thomas
 vanDiesel, Thomas
 Van Diesel, Thomas

Collating details

Expanding Characters

A single character can map to a sequence of collation elements. For instance, is equivalent to ss. In traditional German , , and sort as though they were ae, oe, and ue respectively.

Contracting Character Sequences

Many languages have digraphs, which actually counts as separate letters. In traditional Spanish, ch sorts between c and d, and ll sorts between l and m. Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.

Another example of contractions are lj and nj in Bosnian and Croatian, which sorts after l and n respectively.

Order without contraction
Order with contraction "nj" sorting after "n"
Na
Na
Ni
Ni
Nj
Nk
Nja
Nz
Njz
Nj
Nk
Nja
Nz
Njz
Oa
Oa

Thai/Lao reordering

The Thai and Lao scripts contain left-joining vowels, that precede the consonants. To get the right sort order, the collation element for the vowel is exchanged with the collation element for the consonant that follows the vowel.

Backward Accent Ordering

Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left.

Example:

English ordering
French ordering
Cote
Cote
Cot
Cte
Cte
Cot
Ct
Ct


Mimer
Mimer Information Technology AB
Voice: +46 18 780 92 00
Fax: +46 18 780 92 40
info@mimer.se
Mimer SQL Documentation TOC PREV NEXT INDEX