Collations

In a database, a collation is a set of rules that determine how data is sorted and compared, usually applied to how text data is sorted in different languages for the purpose of indexing text data and for making comparisons between text values.  Different languages have different character sets and ordering.   A collation allows character data  for a given language to be sorted using rules that define the correct character sequence, with options for specifying case-sensitivity, accent marks, kana character types, use of symbols or punctuation, character width, and word sorting.

 

Collations are important anywhere text is sorted or compared and in many other settings as well.  Using US-English standard settings or other defaults can get such operations wrong when the language is other than US-style English.   Manifold therefore provides support for different collations in many settings, such as index on text fields for tables, within transform templates that operate on text fields, and when working with DBMS packages.

 

Manifold can specify a desired collation when creating an index on a table, so Manifolds use of that index will result in correct ordering of text values and correct comparisons between text values.  Manifold can connect to external databases and use indexes within those databases correctly by supporting collations such databases may use.  

External Collations

The Windows operating system provides a built-in collection of collations that programs, such as Manifold, which utilize Windows facilities can count on being available all of the time.  Manifold refers to such collations that are built into Windows simply as collations.  

 

We call a collation external when it is not implemented by Windows.   Prior editions of Manifold once had only those collations implemented by Windows, but current editions have those and other collations as well, for example, custom collations used by  PostgreSQL,  MySQL, or other DBMS packages that have been re-implemented using Manifold code to support how those DBMS packages use collations.  In addition, Manifold supports collations used within ICU, a library for all things Unicode developed and maintained by IBM.  ICU is used by many, if not all, significant applications that support Unicode, including Chrome.

 

Manifold refers to collations not built into Windows as external for three reasons:

 

 

We can think of external collations as not always guaranteed to be available.   Normally, we would use them with data sources that already use them as a part of the database, so if a project can connect to the database the collation will be there as well.  But when it comes to saving data persistently in .map project files, it is better to use a collation implemented  within Windows.  One exception is that external collations based on ICU are likely good enough to be used for persistent data as well, since current ICU collations are included with Manifold, they are based on the Unicode standard and ICU is committed to supporting that standard.

Collations and Databases

Databases very often use text values as unique identifiers, so being able to use indexes on text fields is very important for performance. That can cause a variety of problems, including:

 

 

Manifold's ability to launch queries that partially execute within Manifold and partially within the external database requires Manifold to simultaneously handle external collations and collations specified within Windows.   

The mfd_collate System Data Table

When creating a data source on an external database, for example, in PostgreSQL or Oracle or some other DBMS, Manifold creates a virtual System Data folder that hosts two to four virtual tables, created on the fly for housekeeping.  These tables appear to be part of the data source, but do not exist within the data source and instead are created on the fly for our convenience as if they were in the data source.  They allow treating various external databases as if they were native Manifold components.  

 

An mfd_meta and mfd_root table are always in such a virtual System Data folder. Usually an mfd_srid table listing coordinate systems as spatial reference identifiers also appears, if the database uses SRIDs.    In addition, if the database supports collations an mfd_collate table appears, listing all the collations available in the database.

 

eg_collation_postgresql_mfd_collate.png

The illustration above shows a typical mfd_collate table, from a data source created on a gisdb database within PostgreSQL as enhanced with PostGIS.  We can Copy the value of such a collation and then Paste it into the Custom tab of the Collations dialog (see below) for a database to define a Custom collation.

 

See the discussion in the DBMS Data Sources - Notes topic.

 

Collations and Transform Templates

Transform templates in the Transform panel that use international languages, indicated with Intl in the  template name, allow specifying the collation used for the language.  

 

il_trans_lower_case_intl_text_dlg.png

 

The Language choice shows the current collation in use, with neutral meaning whatever is the default collation, that is, whatever is the current language set by default for the Windows system in use.  

 

To choose a collation:

 

  1. Click the collation picker button.

  2. Choose Edit Collation in the drop down menu.

  3. In the Collation dialog, click the desired collation.

  4. Check boxes for desired collation options.

  5. Press OK.

 

 

btn_collation_picker.png

Clicking the collation picker button calls up a menu that allows choosing any available collation we desire, including changing options for the current collation.

 

mnu_collation_picker.png

 

Collation Dialog Controls

Edit Collation

Launch the Collation dialog to choose a collation or to alter the current collation, for example, by changing options.

(list of favorites)

A list provides one-click choice of any collations in the Favorites list.    The neutral and neutral, nocase collations are listed by default.

   Edit Favorites

Manage the Favorites list of collations, for example, by adding a collation to the list, deleting a favorite, or renaming a favorite.   Only collations from the Standard tab (built into Manifold) may be added as favorites.

 

Collation Dialog

Click the Edit Collation command in the drop down menu to launch the Collation dialog.

 

dlg_collation.pngdlg_collation_database_tab.png

 

The Standard tab always appears.  Use neutral for the default collation language, or the name of a collation taken from a roster of over 150 supported collations, similar to those used by PostgreSQL.   Collations are identified by names such as en-US, en-GB and zh-CN.  The only option allowed for neutral is Ignore case.   Choosing a collation other than neutral will enable other options, even in cases where the option, such as Ignore kana type in the case of Italian languages, will not be used.

Collation Dialog Controls

Standard tab

The Standard tab always appears, listing collations built into Manifold.

Database tab

The Database tab appears for tables hosted within some DBMS data sources, listing collations available in the database.  The collations listed are those listed in the mfd_collate table

Custom tab

The Custom tab appears for tables hosted within some DBMS data sources.  We can Copy collations definitions from the mfd_collate table of a database and Paste into the Custom tab, and then modify them to add a custom collation.  

(Filter Box)

Our best friend when sifting through long lists.   Enter text, such as French into the filter box and only those collations which include that text in their names will be displayed.  

(list pane)

Click on a collation in the list to choose it.

   Ignore case

Ignore case.  The default is case-sensitive sort order.

  Ignore accent

Ignore diacritical marks, accents, and other similar characters, referred to as non-space characters in Unicode jargon.  The default is to use such characters.

  Ignore symbols

Ignore symbols and punctuation.  The default is to utilize symbols and punctuation.

Ignore kana type

Ignore kana type for Asian languages.  The default is to utilize the kana type, so that two instances of the same character in, say, hiragana and katakana, are considered to be different when sorting.

Ignore width

Ignore character widths for Asian languages. The default is to utilize width, so that two instances of the same character which only vary in width are considered to be different when sorting.

Word sort

Ignore space between words, that is, sort only on the characters that make up words and not also the spaces between words. The default is that space between words matters when sorting, so that words are not treated distinctly but only the full content of the string, words and spaces both.

OK

Choose the specified collation with given options.

Cancel

Exit the dialog without making any changes.

Favorite Collations

Collations that have been added to the Favorites collection are automatically listed in the drop down menu for the collations picker button.   By default, two collations are listed as favorites: the neutral collation and neutral, nocase, meaning to use the neutral collation but ignoring case.

 

Called by the Edit Favorites choice in the collation picker button drop down menu, the Favorites dialog for collations allows building and editing a list of frequently used collations.  Any favorites we add will appear in the pull down lists of favorites in menus to pick collations, allowing a one-click choice of a favorite.    Only collations from the Standard tab of the Collations dialog (collations built into Manifold) can appear as favorites.

 

dlg_favorite_collations.png

Controls

btn_sh_add_coord_sys.png

Add - Add a new favorite.  Launches the Collation dialog.

btn_sh_add_defaults_coord_sys.png

Add Defaults - Adds factory default favorite collations to the current list.  Use this to restore factory settings if any of the original default favorites have been deleted.

Ctrl-click

Ctrl-click a favorite to select it or to de-select it.

Ctrl-A

Select all favorites.

Ctrl-I

Invert the selection.  A quick way to select none is to Ctrl-A and then Ctrl-I.

btn_sh_move_top.png

Move to Top - Moves the selected entries to the top of the list.  The entry must be selected for this button to be enabled.

btn_sh_move_up.png

Move Up - Moves the selected entries up one position in the list.  The entry must be selected for this button to be enabled.

btn_sh_move_down.png

Move Down - Moves the selected entries down one position in the list.  The entry must be selected for this button to be enabled.

btn_sh_move_bottom.png

Move to Bottom - Moves the selected entries to the bottom of the list.  The entry must be selected for this button to be enabled.

btn_sh_delete.png

Delete - Remove the selected entries from the list.   The entry must be selected for this button to be enabled.

Add to Favorites

Enabled when the lower pane contains a collation.  Choosing a collation through other means and then launching the Favorites dialog will populate the lower pane with that collation.  Pressing the Add to Favorites button will add it to the Favorites list.

 

Example:  Add a New Collation to Favorites

We can add a new collation to the Favorites list of collations through any Transform template that provides a collation picker button.  In this example we have a component open, such as a table, with a text field in it.  We choose the Transform panel in the Contents pane and choose a text field as the target, in this case a field called Temp_text.   We choose the Lower Case, Intl transform template.

 

eg_add_favorite_collation01_01.png

 

btn_collation_picker.pngWe click the collation picker button and choose Edit Favorites in the drop down menu to launch the Favorites dialog.

 

eg_add_favorite_collation01_02.png

 

We click the Add button, which launches the Collation dialog to allow us to choose a collation.

 

eg_add_favorite_collation01_03.png

 

We choose the Italian (Italy) collation, which uses a collation identification of it-IT,  and press OK.  In this example we do not choose any of the options, such as to Ignore case.  

 

eg_add_favorite_collation01_04.png

 

The new collation appears in the Favorites list, already selected in case we want to move it up or down in the list of favorites.   We press OK.

 

eg_add_favorite_collation01_05.png

 

btn_collation_picker.pngThe next time we click the collation picker button, the new it-IT choice appears in the one-click list of favorites.

 

Example:  Add the Current Collation to Favorites

If we have already chosen a collation we can add it to the list of favorites.  In the example below we first choose a new collation, and then we add it to the list of favorites.

 

eg_add_favorite_collation01_06.png

 

btn_collation_picker.pngWe click the collation picker button and choose Edit Collation in the drop down menu to launch the Collation dialog.

 

eg_add_favorite_collation01_07.png

 

In the Collation dialog we choose the Malay (Malaysia) collation identified with ms_MY and press OK.

 

eg_add_favorite_collation01_08.png

 

The new collation appears as the specified collation in the Language parameter.  Now that we have specified a new collation, suppose we decide to save it in the Favorites list?  That is easy to do.

 

btn_collation_picker.pngWe click the collation picker button and choose Edit Favorites in the drop down menu to launch the Favorites dialog.

 

eg_add_favorite_collation01_09.png

 

The lower pane of the Favorites dialog shows the collation currently specified, in this case, the ms-MY collation.  To add it to the Favorites list we simply press the Add to Favorites button.

 

eg_add_favorite_collation01_10.png

 

It appears in the Favorites list.  We press OK.

 

eg_add_favorite_collation01_11.png

 

btn_collation_picker.pngThe next time we click the collation picker button, the new it-IT choice appears in the one-click list of favorites.   If we would like to go back to neutral, we simply click that choice.

 

Notes

Infrastructure for external collations - Given the varying nature of external collations in different databases, achieving good support for external collations requires significant infrastructure within Manifold.  That infrastructure includes:

 

 

 

ANSI collates are limited to the neutral language - ANSI collates are limited to the neutral language, because we cannot rely on a  linguistic meaning of characters that changes between machines.  Some discussion on what follows from that restriction:

 

If field f is ANSI, then ... ORDER BY f COLLATE 'fr-FR' automatically uses the neutral language. with no error raised, COLLATE 'fr-FR' is accepted and then replaced with 'neutral' at runtime. There are two reasons for why there is no error raised.

 

 

 

The second point above is actually a feature. If we have an ANSI field and still want to use a linguistic collate on it, we can convert it to Unicode dynamically: ... ORDER BY CAST(f AS NVARCHAR) COLLATE 'fr-FR' will use 'fr-FR'.

 

Title case in the 'neutral' language - Manifold builds in the past implemented casing for the neutral language using linguistic rules for en-US.  Current builds now treat the neutral language differently, using the following rules:   All uninterrupted sequences of letter + digit + apostrophe are treated as separate words. The first letter in each word is converted to upper case. All other letters are converted to lower case. Digits and apostrophes are left unchanged.

 

Linguistic rules for en-US are somewhat similar to the above, but they also handle abbreviations.  For example USA in title case remains USA because all letters are upper case.

 

Log examples, showing the function and the result:

 

> ? StringToTitleCase('mary')

nvarchar: Mary

 

> ? StringToTitleCase('anDRew')

nvarchar: Andrew

 

> ? StringToTitleCase('abc123')

nvarchar: Abc123

 

> ? StringToTitleCase('123abc')

nvarchar: 123Abc

 

> ? StringToTitleCase('o\'hara')

nvarchar: O'hara --- why not O'Hara? Consider the en-US case...

 

> ? StringToTitleCaseCollate('o\'hara', CollateCode('en-US'))

nvarchar: O'hara --- Same result

 

 

See Also

Tables

 

Data Types

 

Indexes

 

Queries

 

Schema

 

Command Window

 

Command Window - Query Builder

 

SQL Statements

 

SQL Functions

 

SQL Operators

 

COLLATE

 

Example: Create a Table with a Constraint - Create a simple table that includes a simple constraint upon one of its fields.  

 

Adding an Index to a Table - A basic topic on adding an index using the built-in capabilities of the mfd_id field.

 

Example: Add a Spatial Index to a Table - A typical use of an index is to provide a spatial index on a geom field in a table, so the geom data can be visualized in a drawing. This topic provides the step by step procedure for adding a spatial index.

 

Example: Add a UUID-based Index to a Table - Create a new computed field that is filled with UUID values on creation and then create an index on that field.   This technique creates an indexed field that has guaranteed unique values for all records and thus the indexed field and record values may be used in other projects.