COLLATE

Collations are rules that specify how to compare text values in various international languages. The COLLATE command allows us to incorporate such rules into indexes for text values. Collations can be used with both Unicode and ANSI text values.  

 

COLLATE '<name>'

 

<name> - The name of a collation taken from a roster of over 150 supported collations, similar to those used by PostgreSQL.   Collations are identified by names such as en-US, en-GB and zh-CN.   An empty name is allowed and will specify the default collation, which is based on US English and will perform the fastest.  

 

Important: the Collate(<name>, <useCase>, <useAccent>, <useSymbols>) : <value> function is different than COLLATE command for indexes.   The COLLATE command is a part of the INDEX statement that specifies a collation for use in an index.   The Collate() function returns a number that identifies a collation with specified options, which number can be used as an argument with other SQL functions that utilize collations.

 

The usual options for indexes are support.   There can be any number of options or no options.  Available options include:

 

ASC or DESC- sort in ascending order (default) or descending order.

CASE or NOCASE - case-sensitive or case-insensitive sort.

ACCENT or NOACCENT - use (default) or ignore diacritical marks and other similar characters, referred to as non-space characters in Unicode jargon.

SYMBOLS or NOSYMBOLS - use (default) or ignore symbols and punctuation.

 

Using Collations

Collations can be utilized in the following circumstances:

 

 

SQL Functions using Collations

The following SQL functions utilize collations.

 

Collate(<name>, <useCase>, <useAccent>, <useSymbols>) : <value>

The Collate function takes the name of a collation and True/False options to use case, accents and symbols, and returns a numeric value that encodes that collation which can be used as the <collation> argument in functions that use collations.

 

? Collate('en-US', True, True, True)

 

The above generates the number 928, which encodes the use of a US English collation sensitive to case, accents and symbols. 

 

? Collate('en-US', False, True, True)

 

The above generates the number 930, which encodes the use of a US English collation where upper or lower case does not matter but which is sensitive to accents and symbols. 

StringCompareCollate(<string>, <string>, <collation>) : <value>

Same as Compare but compares two strings using the specified collation.  

 

Returns -1 if the first string is less than the second string, 1 if the first string is greater than the second string and 0 if the first and second strings are the same.   

 

In the examples that follow 928 is the collation code for en-US (US English) that is case sensitive.   930 is the collation code for en-US where case does not matter.

 

? StringCompareCollate('John','john', 928)

 

returns 1, meaning the first string is greater than the second string.  

 

? StringCompareCollate('John','john', 930)

 

returns 0, meaning the two strings are the same (as they are if case does not matter).

StringCompareKeyCollate(<string>, <collation>) : <value>

Given a string and a specified collation returns a binary representation of the text using the specified collation.  The produced key can be compared to keys for other strings with the same collation, and such comparisons will yield the same result as if the comparison was done using StringCompareCollate(). However, comparing binary keys is usually much faster than comparing strings using a given collation.

StringContainsCollate(<string>, <substring>, <collation>) : <boolean>

Given a string, a substring and a collation returns True if the string contains the substring using the rules of the specified collation.   

 

In the examples that follow 928 is the collation code for en-US (US English) that is case sensitive.   930 is the collation code for en-US where case does not matter.

 

? StringContainsCollate('Long John Silver', 'john', 928)

 

returns 0 for False,  because if case matters john does not occur in the first string.

 

? StringContainsCollate('Long John Silver', 'john', 930)

 

returns 1 for True,  because if case does not matter john occurs as the second word in the first string.

StringEndsWithCollate(<string>, <substring>, <collation>) : <boolean>

Given a string, a substring and a collation returns True if the string ends with the substring using the rules of the specified collation.    See the StringContainsCollate function for similar examples.

StringStartsWithCollate(<string>, <substring>, <collation>) : <boolean>

Given a string, a substring and a collation returns True if the string starts with the substring using the rules of the specified collation.    See the StringContainsCollate function for similar examples.

StringFindCollate(<string>, <substring>, <collation>) : <position>

Given a string, a substring and a collation returns the position, beginning with 0,  in the string of the first character of the substring if the string contains the substring using the rules of the specified collation.   If the string does not contain the substring return -1.

 

In the examples that follow 928 is the collation code for en-US (US English) that is case sensitive.   930 is the collation code for en-US where case does not matter.

 

? StringFindCollate('Long John Silver','john', 928)

 

returns -1 because if case matters john does not occur in the first string.

 

? StringFindCollate('Long John Silver','john', 930)

 

returns 5 because if case does not matter the substring john begins at the sixth character in the first string.

 

See the Text Position template in the Transform Templates - Numeric topic.

StringFindNthCollate(<string>, <substring>, <occurrence>, <collation>) : <position>

Given a string, a substring and an <occurrence> number, n, (zero based counting, with the first <occurrence> being 0, the second being 1 and so on) and a collation, return the number of the character position in the text, also zero based, where the substring matches the nth time using the rules of the specified collation.   If not found at the specified nth position return -1.

 

In the examples that follow 930 is the collation code for en-US (US English) that is not case sensitive.

 

? StringFindNthCollate('Long John Silver','L', 1, 930)

 

returns 12 because if case does not matter the second instance (indicated by 1 in zero based counting) of the substring 'L' occurs at the thirteenth character (indicated by 12 in zero based counting).

 

? StringFindNthCollate('Long John Silver','L', 2, 930)

 

returns -1 because there is no third instance of the substring 'L'.

 

See the Text Position, Nth template in the Transform Templates - Numeric topic.

StringReplaceCollate(<string>, <source>, <target>, <collation>) : <string>

Given a string, a source string, a target string and a collation return a modified string with all occurrences of the source string, using the rules of the specified collation,  replaced by the target string.   If no instances of the source string are found, returns the original string.   

 

In the example that follows 930 is the collation code for en-US (US English) that is not case sensitive.

 

? StringReplaceCollate('Long John Silver','long', 'Short', 930)

 

The above expression returns:

 

nvarchar: Short John Silver

 

See the Replace Text, All, Intl template in the Transform Templates - Text topic.

StringReplaceNthCollate(<string>, <source>, <target>, <occurrence>, <collation>) : <string>

Given a string, a source string, a target string, an <occurrence> number, n, (zero based counting, with the first <occurrence> being 0, the second being 1 and so on) and a collation, return a modified string with the nth occurrence of the source string, using the rules of the specified collation,  replaced by the target string.  If the source string is not found at the specified nth position returns the original string.

 

In the examples that follow 930 is the collation code for en-US (US English) that is not case sensitive.

 

? StringReplaceNthCollate('Long John Silver','L', 'X', 1, 930)

 

The above expression returns:

 

nvarchar: Long John SiXver

 

? StringReplaceNthCollate('Long John Silver','L', 'X', 2, 930)

 

The above expression returns the original string since there is no third occurrence of the source string 'L'.

 

See the Replace Text, Nth, Intl template in the Transform Templates - Text topic.

StringStartsWithCollate(<string>, <substring>, <collation>) : <boolean>

Given a string, a substring and a collation returns True if the string starts with the substring using the rules of the specified collation.    See the StringContainsCollate function for similar examples.

StringToLowerCaseCollate(<string>, <collation>) : <string>

Given a string and a collation returns the same string all in lower case using the rules of the specified collation.  Ignores <useCase> option in the encoded collation since the result is all lower case.

 

See the Lower Case, Intl template in the Transform Templates - Text topic.

StringToTitleCaseCollate(<string>, <collation>) : <string>

Given a string and a collation returns the same string in title case using the rules of the specified collation.  Ignores <useCase> option in the encoded collation since the result is title case.

 

See the Title Case, Intl template in the Transform Templates - Text topic.

StringToUpperCaseCollate(<string>, <collation>) : <string>

Given a string and a collation returns the same string all in upper case using the rules of the specified collation.  Ignores <useCase> option in the encoded collation since the result is all upper case.

 

See the Upper Case, Intl template in the Transform Templates - Text topic.

 

Notes

Manifold utilizes collations differently than some other database systems.

 

Many database systems tie collations to individual text fields so that each field gets a collation, either the default collation or a collation explicitly specified in CREATE TABLE.   In such systems when the query engine sees a construction such as ORDER BY f, it uses the collation for the field f and orders accordingly and it resolves expressions such as f = 'abc' using the collation for f, which may or may not be case sensitive. When the engine encounters an expression such as f <= g it will fail the operation in cases where the collation for field f differs from that for field g.  The expression fails because the operator has two competing sets of rules to apply and cannot choose between them.

 

Such systems allow resolutions of such conflicts by explicitly using COLLATE to temporarily match the collation of f to g for the duration of the operator, using an expression such as f COLLATE ... <= g.   Such systems provide other means to manage conflicts that can arise when collations are tied to individual fields, for example, providing rules showing how collations are conveyed through functions, with each function having its own custom rules.  For example, a rule might specify the result of Substring(f, ...) will have the same collation as f.

 

Manifold takes a different approach and ties collations to specific constructs which do comparisons such as ORDER.  There are advantages and disadvantages to that approach.

 

The biggest advantages are that there are no conflicts and no hidden rules.  Everything specific to collations is clearly visible.   For example, we can be sure that an expression such as f = 'A' is case-sensitive.  Another big and important advantage is that we do not face a choice of either creating a way of conveying collation information both ways (both for arguments and return values) for external functions, or having to live with external functions being  different from internal functions in that external functions could not use collations.

 

The main disadvantages to Manifold's approach is that first, if we always use a specific collation with some field we will have to write that collation more than once (although we can reduce the need for repetitive work by using functions) and second, other database systems more frequently tie collations to individual fields so the Manifold approach might at first be unfamiliar.

 

Collations in Manifold use PostgreSQL style names, typically using an abbreviation for the language, such as fr for French or en for English and a two letter code for the country, such as US for the United States, FR for France and BE for Belgium.   French as spoken in Belgium has the collation name fr-BE, French as spoken in France is fr-FR, English as spoken in the US is en-US, and English as spoken in Great Britain en-GB.  MySQL locale values are similar, but using an underscore character instead of a hyphen, for constructions such as fr_BE.