Collations and Linguistic Sorting
As stated in the previous section, character and national character data is sorted according to specific collations.
A collation, also known as a collating sequence, is a database object containing a set of rules that determines how character strings are compared, searched and alphabetically sorted. The rules in the collation determine whether one character string is less than, equal to or greater than another. A collation also determines how case-sensitivity and accents are handled.
You can specify a different collation for ordering characters when you create or alter a table or create a domain.
If you have specified a collation for a column, the collation is used implicitly in SQL statements.
You only need to explicitly use a collation clause in SQL statements if you want to override the default collation or the collation you specified when creating or altering the column or creating the domain.
For more information, see Mimer SQL User's Manual, Collations.
Since Unicode is a universal character set the Unicode sorting order can be employed on any arbitrary character set that is a subset of Unicode.
The default Unicode sorting order is provided in http://www.unicode.org/reports/tr10/allkeys.txt. This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements for all the explicitly weighted characters.
As mentioned above there are different levels of comparisons to pay attention to, such as case and accent sensitivity. From the Default Unicode Collation Element Table, referred to above, the following definition for the letter b is picked:0062 ; [.0A29.0020.0002.0062] # LATIN SMALL LETTER B
Within square brackets there are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level:
Typically, this is used to denote differences between base characters (for example, a < b). It is the strongest difference. For example, dictionaries are divided into different sections by the base character. This is also called the level-1 strength.
Mimer SQL's predefined level 1 collations have names ending with _1, e.g. ENGLISH_1.
Accents in the characters are usually considered secondary differences (for example, ab < áb < ac). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
- Note: In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.
Upper and lower case differences in characters are distinguished at the tertiary level (for example, ab < Ab < áb). In addition, a variant of a letter differs from the base form on the tertiary level (such as a and ª). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.
Mimer SQL's predefined level 3 collations have names ending with _3, e.g. ENGLISH_3.
When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, ab < a c < a-c < ac). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.
Multilevel comparison means the following; Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence cannot be established, the two strings are regarded as equivalent.
How far to go in this comparison chain is decided by the definition of the collation used on the data. See Tailorings.
Alternate collation elements, i.e. punctuation, can be treated different depending on the weighting method used:
Alternate collation elements are treated as normal collation elements. This is the default option.
Alternate collation elements are set to zero at the primary, secondary, and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements, with a non-zero primary weight, will receive a fourth-level weight of 0xFFFF. If the primary weight is zero, the fourth-level weight is also zero.
Alternate collation elements are set to zero at the primary, secondary, and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements are set to zero. This will emulate POSIX behavior.
The following gives an example of the alternate weighting differences.
de luge death death de Luge de luge deluge de-luge de-luge de luge de-Luge deluge de-luge death de Luge deLuge deluge de-Luge de Luge deLuge deLuge de-Luge demark demark demark
A tailoring is a set of rules and attributes that forms a so called collation delta string, which is used as the basis when creating a collation. When a new collation is to be created, the tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement, see CREATE COLLATION.
When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.
Attributes are optional.
Level [Level 1][Level 2][Level 3][Level 4] Sort level for the collation.[Level 3] is default. Accent order [AccentOrder Forward][AccentOrder Backward] Secondary level ordering direction.[AccentOrder Forward] is default. Case first [CaseFirst Lower][CaseFirst Upper] Tertiary level case ordering.[CaseFirst Lower] is default. Alternate [Alternate Non-ignorable][Alternate Shifted][Alternate Shift-trimmed] Alternatives for punctuation.[Alternate Non-ignorable] is default. Hiragana [Hiragana On][Hiragana Off] Option for Japanese sorting.Use
[Hiragana On][Level 4] for full Japans ordering.[Hiragana Off] is default.
Numeric [Numeric On][Numeric Off] Option for numeric sorting.[Numeric Off] is default.
The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.
& &Z Reset at this letter. Rules will be relative to this letter from here on. < a < b Identifies a primary level difference between a and b. << e << ê Identifies a secondary level difference between e and ê. <<< s <<< S Identifies a tertiary level difference between s and S. = i = y Signifies no difference between i and y. " "," The quoted character , (comma). # #0141# Hexadecimal representation of Polish L with stroke.
Note: ; can be used to represent secondary relations and , to represent tertiary relations, instead of << and <<< respectively.
The following is a Danish tailoring example:
Here is an example on how to sort numerical data properly:SQL>CREATE TABLE alphanum (codes VARCHAR(10)); SQL>INSERT INTO alphanum VALUES('A123'); SQL>INSERT INTO alphanum VALUES('A234'); SQL>INSERT INTO alphanum VALUES('A23'); SQL>INSERT INTO alphanum VALUES('A3'); SQL>INSERT INTO alphanum VALUES('A1'); SQL>-- Regular order [Numeric Off] SQL>SELECT * FROM alphanum ORDER BY codes; CODES ========== A1 A123 A23 A234 A3 SQL>-- Numeric order [Numeric On] SQL>CREATE COLLATION numeric FROM eor USING '[Numeric On]'; SQL>SELECT * FROM alphanum ORDER BY codes COLLATE numeric; CODES ========== A1 A3 A23 A123 A234
Here is an example on how to sort two fields properly; in this case 'last name', 'first name':SQL>create table name(last varchar(32),first varchar(32)); SQL>insert into name values('van Diesel','Peter'); SQL>insert into name values('van Diesel','Thomas'); SQL>insert into name values('vanDiesel','Peter'); SQL>insert into name values('vanDiesel','Thomas'); SQL>insert into name values('Van Diesel','Peter'); SQL>insert into name values('Van Diesel','Thomas'); SQL>insert into name values('Van','Stephan'); SQL>insert into name values('Van','Buster'); SQL>create collation names from EOR_1 SQL&using '[level 4][alternate shifted]&9<","'; -- ',' before 'A' SQL>select last || ', ' || first as fullname SQL&from name order by fullname collate names; FULLNAME ======== Van, Buster Van, Stephan van Diesel, Peter vanDiesel, Peter Van Diesel, Peter van Diesel, Thomas vanDiesel, Thomas Van Diesel, Thomas
A single character can map to a sequence of collation elements. For instance, ß is equivalent to ss. In traditional German ä, ö, and ü sort as though they were ae, oe, and ue respectively.
Contracting Character Sequences
Many languages have digraphs, which actually counts as separate letters. In traditional Spanish, ch sorts between c and d, and ll sorts between l and m. Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.
Another example of contractions are lj and nj in Bosnian and Croatian, which sorts after l and n respectively.
Na Na Ni Ni Nj Nk Nja Nz Njz Nj Nk Nja Nz Njz Oa Oa
The Thai and Lao scripts contain left-joining vowels, that precede the consonants. To get the right sort order, the collation element for the vowel is exchanged with the collation element for the consonant that follows the vowel.
Backward Accent Ordering
Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left.
Mimer Information Technology AB
Voice: +46 18 780 92 00
Fax: +46 18 780 92 40