Customizing language sort orders for `libqdb_cldr.so`

Different locales, organizations and implementations may require different sort orders, even for the same language. This section describes how you can customize sort orders for your MME implementation:

Standard language sort order files
Sort order algorithm
Adding a new sort order locale
Tailoring a sort order algorithm

Standard language sort order files

The CLDR (Common Locale Data Repository) tables shipped with the MME use the standard Unicode ordering produced from the CLDR project. These tables are converted from the standard in CLDR POSIX format and shipped as binary files suitable for the libqdb_cdlr.so. They are located at etc/cldr/:

cs_CZ — Czech
da_DK — Danish
de_DE — German (Germany)
en_US — English (U.S.)
es_ES — Spanish (Spain)
fr_FR — French (France)
it_IT — Italian
ja_JP — Japanese
ko_KR — Korean
nb_NO — Norwegian (Bokmål)
sv_SE — Swedish
zh_CN — Chinese (simplified)

For information about converting CLDR POSIX tables, see the mkcldr page in the Neutrino Utilities Reference. For information about the CLDR project, see www.unicode.org/cldr/.

Sort order algorithm

The language collation (or sorting) DLL, libqdb_cldr.so, implements the Unicode Collation Algorithm. This algorithm is a multi-level comparison algorithm, where each character has a number of weights.

These character weights typically correspond to the base character, accents, case, and punctuation. The primary weight has the most relevance, while lower-level weights provide a tie-breaking role; that is, they determine sort order when two or more different elements are assigned the same primary weight.

For example, depending on specific language, locale and even institutional conventions, the characters “a” and “á”, and “A” and “Á” might all be assigned the same primary weight, but require differentiation at the secondary and tertiary levels respectively.

For more information about the Unicode Collation Algorithm, see www.unicode.org/reports/tr10/.

Contractions and expansions

The Unicode Collation Algorithm supports the character contractions and expansions required for correct sorting in some languages. For example, in traditional Spanish “ch” is considered a single letter that sorts after “c” and before “d”, and in German, “æ” is sorted as “a” followed by “e”. Correct sorting in these languages requires, respectively, contraction and expansion of the characters being sorted.

Locale data files

The Unicode Common Locale Data Repository includes sort order files for more than 200 locales. Each locale has a table of character weights, and contraction and expansion sequences that describes the sort ordering for that locale.

The filenames for these files take the form language_LOCALE.UTF-8.src. For example, the file for French in Canada is fr_CA.UTF-8.src.

Adding a new sort order locale

To add a new locale sort order:

Download the required CLDR locale file from cldr.unicode.org.
Convert the downloaded sort order table to the binary data format used by libqdb_cldr.so. See “Converting CLDR POSIX files” below.
Configure your application to use the new sort order.

Converting CLDR POSIX files

Use the mkcldr utility to convert CLDR POSIX files to binary files suitable for the libqdb_cdlr.so. For example, the following example converts the file for German used in Switzerland:

$ cd cldr-1.4.1/posix
$ mkcldr -c UTF-8.cm de_CH.UTF-8.src /etc/cldr/de_CH

The UTF-8.cm file is simply a database that maps textual character descriptions to their Unicode value; it is used in parsing the collation information.

Tailoring a sort order algorithm

If none of the standard CLDR locale files defines exactly the sort order you require, you can tailor the character weights in an existing file, which you can then save and convert to use in your custom implementation.

The POSIX format for sort order files is a processed output with explicit weights already assigned to the sort order. It may be simplest, therefore, to tailor a sort order by modifying the XML file from which a sort order file was generated.

That is, to tailor a sort order:

Download from unicode.org/repos/cldr/trunk/docs/web/repository_access.html and unzip the files for the latest release:
- the XML/LDML files (core.zip)
- the Java tools for generating POSIX files from the XML files (tools.zip)
Open the XML file for the sort order you want to change and tailor the character weights as required.
Use the GeneratePOSIX utility from the Java tools download to generate a POSIX file with the tailored sort order.
Use the QNX mkcldr utility to convert the new POSIX file to a binary file suitable for the libqdb_cdlr.so, as explained above in “Converting CLDR POSIX files”.

To use your new (language_LOCALE.UTF-8.src) file, it must be in the /etc/cldr directory or the directory specified by $QDB_CLDR_PATH, as required by your system configuration. Ensure, then, that the new file is copied to this location, either as part of the XML to POSIX conversion with mkcldr, or by a simple file copy afterwards.