|
|
Collations and Linguistic Sorting
This chapter provides the basic concepts of national characters and linguistic sorting.
The default Unicode sorting order is provided in http://www.unicode.org/reports/tr10/allkeys.txt. This table (the Default Unicode Collation Element Table) provides a mapping from characters to collation elements for all the explicitly weighted characters.
Multilevel Comparisons
As mentioned in the previous chapter there are different levels of comparisons to pay attention to, such as case and accent sensitivity. From the Default Unicode Collation Element Table, referred to above, the following definition for the letter b is picked:
0062 ; [.0A29.0020.0002.0062] # LATIN SMALL LETTER B
Within square brackets there are four levels of comparison keys; the Primary level, the Secondary level, the Tertiary level and the Quaternary level:
Primary level:
Typically, this is used to denote differences between base characters (for example, a < b). It is the strongest difference. For example, dictionaries are divided into different sections by the base character. This is also called the level-1 strength.
Mimer SQL's predefined level 1 collations have names ending with _1, e.g. ENGLISH_1.
Secondary level:
Accents in the characters are usually considered secondary differences (for example, ab < áb < ac). A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
Mimer SQL's predefined level 2 collations have names ending with _2, e.g. ENGLISH_2.
- Note: In some languages (such as Icelandic), certain accented letters are considered to be separate base characters.
Tertiary level:
Upper and lower case differences in characters are distinguished at the tertiary level (for example, ab < Ab < áb). In addition, a variant of a letter differs from the base form on the tertiary level (such as a and ª). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called level-3 strength.
Mimer SQL's predefined level 3 collations have names ending with _3, e.g. ENGLISH_3.
Quaternary level:
When punctuation is ignored (such as space and hyphen) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, ab < a c < a-c < ac). A quaternary difference is ignored when there is a primary, secondary, or tertiary difference. This is also called the level-4 strength.
Multilevel comparison means the following; Two strings are compared on the primary level. If the comparison for this level fails to establish a unique and determined sequence for the strings, the second level are taken into consideration. If this likewise fails to produce a unique sequence, the tertiary level is invoked, and after this the quaternary level is used. If still a unique sequence cannot be established, the two strings are regarded as equivalent.
How far to go in this comparison chain is decided by the definition of the collation used on the data. See Tailorings.
Alternate Weighting
Alternate collation elements, i.e. punctuation, can be treated different depending on the weighting method used:
Non-ignorable
Alternate collation elements are treated as normal collation elements. This is the default option.
Shifted
Alternate collation elements are set to zero at the primary, secondary, and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements, with a non-zero primary weight, will receive a fourth-level weight of 0xFFFF. If the primary weight is zero, the fourth-level weight is also zero.
Shift-trimmed
Alternate collation elements are set to zero at the primary, secondary, and tertiary level, and the fourth-level weight is set to the primary weight. All other collation elements are set to zero. This will emulate POSIX behavior.
The following gives an example of the alternate weighting differences.
Tailorings
A tailoring is a set of rules and attributes that forms a so called collation delta string, which is used as the basis when creating a collation. When a new collation is to be created, the tailoring describes how to modify an existing collation definition to get the new one. A collation is created by the CREATE COLLATION statement, see CREATE COLLATION.
Attributes
When creating a collation, the tailoring string can include attribute settings for comparison level, accent order, which case that should be first in order and alternate weighting.
Special sort rules
Rules
The rules in a tailoring string defines how to change the underlying collation. Each rule contains a string of ordered characters that starts with a reset value.
Note: ; can be used to represent secondary relations and , to represent tertiary relations, instead of << and <<< respectively.
Example:
The following is a Danish tailoring example:
Sorting
Here is an example on how to sort numerical data properly:
SQL>CREATE TABLE alphanum (codes VARCHAR(10)); SQL>INSERT INTO alphanum VALUES('A123'); SQL>INSERT INTO alphanum VALUES('A234'); SQL>INSERT INTO alphanum VALUES('A23'); SQL>INSERT INTO alphanum VALUES('A3'); SQL>INSERT INTO alphanum VALUES('A1'); SQL>-- Regular order [Numeric Off] SQL>SELECT * FROM alphanum ORDER BY codes; CODES ========== A1 A123 A23 A234 A3 SQL>-- Numeric order [Numeric On] SQL>CREATE COLLATION numeric FROM eor USING '[Numeric On]'; SQL>SELECT * FROM alphanum ORDER BY codes COLLATE numeric; CODES ========== A1 A3 A23 A123 A234Here is an example on how to sort two fields properly; in this case 'last name', 'first name':
SQL>create table name(last varchar(32),first varchar(32)); SQL>insert into name values('van Diesel','Peter'); SQL>insert into name values('van Diesel','Thomas'); SQL>insert into name values('vanDiesel','Peter'); SQL>insert into name values('vanDiesel','Thomas'); SQL>insert into name values('Van Diesel','Peter'); SQL>insert into name values('Van Diesel','Thomas'); SQL>insert into name values('Van','Stephan'); SQL>insert into name values('Van','Buster'); SQL>create collation names from EOR_1 SQL&using '[level 4][alternate shifted]&9<","'; -- ',' before 'A' SQL>select last || ', ' || first as fullname SQL&from name order by fullname collate names; FULLNAME ======== Van, Buster Van, Stephan van Diesel, Peter vanDiesel, Peter Van Diesel, Peter van Diesel, Thomas vanDiesel, Thomas Van Diesel, ThomasCollating Details
Expanding Characters
A single character can map to a sequence of collation elements. For instance, ß is equivalent to ss. In German Phonebook ä, ö, and ü sort as though they were ae, oe, and ue respectively.
Contracting Character Sequences
Many languages have digraphs, which actually counts as separate letters. In traditional Spanish, ch sorts between c and d, and ll sorts between l and m. Two characters are mapped into a single collation element that cause the combination to be ordered differently from either character individually.
Another example of contractions are lj and nj in Bosnian and Croatian, which sorts after l and n respectively.
Order without contraction
Order with contraction "nj" sorting after "n"
Na Na Ni Ni Nj Nk Nja Nz Njz Nj Nk Nja Nz Njz Oa OaBackward Accent Ordering
Some languages, particularly French, require words to be ordered on the secondary level by comparing backwards from right to left.
Example:
Indic
Attribute: [indic]
Function: Method for traditional Indic collationThe following transformations are applied before sorting:
- Consonant + Virama => Consonant
- Consonant + Vowel-sign => Consonant + Vowel
- Consonant => Consonant + A
The method for traditional Indic collation effectively works for the following scripts:
- Devanagari (Hindi, Konkani, Marathi, Nepali, and Sanskrit)
- Bengali (Assamese, Bengali, and Manipuri)
- Gujarati
- Oriya
- Telugu
- Kannada
- Malayalam
- Sinhala
The famous authoritative Monier-Williams: Sanskrit-English Dictionary is a good reference:
http://www.ibiblio.org/sripedia/ebooks/mw/
http://www.ibiblio.org/sripedia/ebooks/mw/0000/mw__0033.htmlThe [indic] attribute also works for Tamil, but with different rules as used in the authoritative University of Madras: Tamil Lexicon
http://dsal.uchicago.edu/dictionaries/tamil-lex/Punjabi does not need any tailoring, the default order follows the rules in the Punjabi University: Punjabi-English Dictionary ISBN:8173800960.
Without the [indic] attribute, a very large tailoring is needed for traditional collation. See the following example for Devanagari.
·#<#0915##094D# Γ#<#0916##094D# Δ#<#0917##094D# Ε#<#0918##094D# Ζ#<#0919##094D# Η#<#091A##094D# [A#<#091B##094D# [B#<#091C##094D# [C#<#091D##094D# [D#<#091E##094D# [E#<#091F##094D# [F#<#0920##094D# Θ#<#0921##094D# Ι#<#0922##094D# Κ#<#0923##094D# Λ#<#0924##094D# Μ#<#0925##094D# Ν#<#0926##094D# Ξ#<#0927##094D# Ο#<#0928##094D# Π#<#092A##094D# \A#<#092B##094D# \B#<#092C##094D# \C#<#092D##094D# \D#<#092E##094D# \E#<#092F##094D# \F#<#0930##094D# ΢#<#0932##094D# Τ#<#0933##094D# Υ#<#0935##094D# Χ#<#0936##094D# Ψ#<#0937##094D# Ω#<#0938##094D# Ϊ#<#0939##094D# Γ##094D##0905#=#0915# Γ##094D##0906#=#0915##093E# Γ##094D##0907#=#0915##093F# Γ##094D##0908#=#0915##0940# Γ##094D##0909#=#0915##0941# Γ##094D##090A#=#0915##0942# Γ##094D##090B#=#0915##0943# Γ##094D##0960#=#0915##0944# Γ##094D##090C#=#0915##0962# Γ##094D##0961#=#0915##0963# Γ##094D##090D#=#0915##0945# Γ##094D##090E#=#0915##0946# Γ##094D##090F#=#0915##0947# Γ##094D##0910#=#0915##0948# Γ##094D##0911#=#0915##0949# Γ##094D##0912#=#0915##094A# Γ##094D##0913#=#0915##094B# Γ##094D##0914#=#0915##094C# ... same pattern for #0916#..#0938# (32) ... Ϋ##094D##0905#=#0939# Ϋ##094D##0906#=#0939##093E# Ϋ##094D##0907#=#0939##093F# Ϋ##094D##0908#=#0939##0940# Ϋ##094D##0909#=#0939##0941# Ϋ##094D##090A#=#0939##0942# Ϋ##094D##090B#=#0939##0943# Ϋ##094D##0960#=#0939##0944# Ϋ##094D##090C#=#0939##0962# Ϋ##094D##0961#=#0939##0963# Ϋ##094D##090D#=#0939##0945# Ϋ##094D##090E#=#0939##0946# Ϋ##094D##090F#=#0939##0947# Ϋ##094D##0910#=#0939##0948# Ϋ##094D##0911#=#0939##0949# Ϋ##094D##0912#=#0939##094A# Ϋ##094D##0913#=#0939##094B# Ϋ##094D##0914#=#0939##094C#Japanese
Attribute: [CJK Kanji]
Function: JIS X 4061-1996 rules for SOUND/ITERATION MARKSThis attribute is an implementation of JIS X 4061-1996 and the collation rules are based on that standard.
The following criteria are considered in order until the collation order is determined. By default, Levels 1 to 4 are applied and Level 5 is ignored (as JIS does).
Level 1: alphabetic ordering
The character classes are sorted in the following order:
Space characters, Symbols and Punctuations, Digits, Latin Letters, Greek Letters, Cyrillic Letters, Hiragana/Katakana letters, Kanji ideographs.In the class, alphabets are collated alphabetically; Kana letters are AIUEO-betically (in the Gozyuon order).
For Kanji, see Kanji Classes.
Other characters are collated as defined.
Characters not defined as a collation element are ignored and skipped on collation.
Level 2: diacritic ordering
In the Latin vowels, the order is as shown the following list.
One without diacritical mark, then with diacritical mark.In Kana, the order is as shown the following list.
A voiceless kana, the voiced, then the semi-voiced (if exists). (eg. Ka before Ga; Ha before Ba before Pa)Level 3: case ordering
A small Latin character is less than the corresponding capital character.
In Kana, the order is as shown in the following list:
replaced PROLONGED SOUND MARK(U+30FC); Small Kana; replaced ITERATION MARK (U+309D, U+309E, U+30FD or U+30FE); then normal kanaFor example, Katakana A + PROLONGED SOUND MARK, Katakana A + Small Katakana A, Katakana A + ITERATION MARK, Katakana A + Katakana A.
Level 4: variant ordering
Hiragana is lesser than Katakana.
Level 5: width ordering
A character that belongs to the block Halfwidth and Fullwidth Forms is greater than the corresponding normal character.
Note: According to the JIS standard, the level 5 should be ignored.
Kanji Classes
There are three Kanji classes:
It comprises five Kanji-like characters, i.e. U+3003, U+3005, U+4EDD, U+3006, U+3007. Any Kanji except U+4EDD are ignored on collation.
It comprises JIS levels 1 and 2 kanji in addition to the minimum Kanji class. Sorted in the JIS order. Any Kanji excepting those defined by JIS X 0208 are ignored on collation.
All the CJK Unified Ideographs in addition to the minimum Kanji class. Sorted in the Unicode order.
Note: This is the implemented class.
Korean
Attribute: [CJK Hanja]
Function: Special sort table accessHanja characters are sorted with secondary difference from the corresponding Hangul character.
Vietnamese
Attribute: [CJK ChuNom]
Function: Syllable by syllable processingIn lexical ordering, differences in letters are treated as primary, differences in tone markings as secondary, and differences in case as tertiary differences. Ordering according to primary and secondary differences proceeds syllable by syllable. According to this principle, a dictionary lists "ban mai" before "bàn cát" because the secondary difference in the first syllable takes precedence over the primary difference in the second.
|
Mimer Information Technology AB Voice: +46 18 780 92 00 Fax: +46 18 780 92 40 info@mimer.se |
|
|