Different locales, organizations and implementations may require different sort orders, even for the same language. This section describes how you can customize sort orders for your MME implementation:
The CLDR (Common Locale Data Repository) tables shipped with the MME use the standard Unicode ordering produced from the CLDR project. These tables are converted from the standard in CLDR POSIX format and shipped as binary files suitable for the libqdb_cdlr.so. They are located at etc/cldr/:
For information about converting CLDR POSIX tables, see the mkcldr page in the Neutrino Utilities Reference. For information about the CLDR project, see www.unicode.org/cldr/.
The language collation (or sorting) DLL, libqdb_cldr.so, implements the Unicode Collation Algorithm. This algorithm is a multi-level comparison algorithm, where each character has a number of weights.
These character weights typically correspond to the base character, accents, case, and punctuation. The primary weight has the most relevance, while lower-level weights provide a tie-breaking role; that is, they determine sort order when two or more different elements are assigned the same primary weight.
For example, depending on specific language, locale and even institutional conventions, the characters “a” and “á”, and “A” and “Á” might all be assigned the same primary weight, but require differentiation at the secondary and tertiary levels respectively.
For more information about the Unicode Collation Algorithm, see www.unicode.org/reports/tr10/.
The Unicode Collation Algorithm supports the character contractions and expansions required for correct sorting in some languages. For example, in traditional Spanish “ch” is considered a single letter that sorts after “c” and before “d”, and in German, “æ” is sorted as “a” followed by “e”. Correct sorting in these languages requires, respectively, contraction and expansion of the characters being sorted.
The Unicode Common Locale Data Repository includes sort order files for more than 200 locales. Each locale has a table of character weights, and contraction and expansion sequences that describes the sort ordering for that locale.
The filenames for these files take the form language_LOCALE.UTF-8.src. For example, the file for French in Canada is fr_CA.UTF-8.src.
To add a new locale sort order:
Use the mkcldr utility to convert CLDR POSIX files to binary files suitable for the libqdb_cdlr.so. For example, the following example converts the file for German used in Switzerland:
$ cd cldr-1.4.1/posix $ mkcldr -c UTF-8.cm de_CH.UTF-8.src /etc/cldr/de_CH
The UTF-8.cm file is simply a database that maps textual character descriptions to their Unicode value; it is used in parsing the collation information. |
If none of the standard CLDR locale files defines exactly the sort order you require, you can tailor the character weights in an existing file, which you can then save and convert to use in your custom implementation.
The POSIX format for sort order files is a processed output with explicit weights already assigned to the sort order. It may be simplest, therefore, to tailor a sort order by modifying the XML file from which a sort order file was generated.
That is, to tailor a sort order:
To use your new (language_LOCALE.UTF-8.src) file, it must be in the /etc/cldr directory or the directory specified by $QDB_CLDR_PATH, as required by your system configuration. Ensure, then, that the new file is copied to this location, either as part of the XML to POSIX conversion with mkcldr, or by a simple file copy afterwards. |